Energy Grid Forecasting¶


Energy forecasting is a vital part of today's energy markets and operations. It is used to set the cost of electricity, determine when power generation should be increased, and decide when power needs to be directed to or from other regions of the electrical grid. Electrical energy demand continues to rise, especially with the recent widespread adoption of electric vehicles and heat pumps. As a result, the electrical grid will need to become more robust and adaptive as this increase in demand comes in tandem with a transition away from fossil fuels and into a more diversified energy market and with it, more complex cycles. These circumstances make accurate forecasting and periodicity identification of electrical loads and generation that much more important, and it is necessary to research the differences in forecasting applications and relevance in various short-term, medium-term, and long-term settings.

Many of the statistical time series forecasting techniques of the past are still in use today, each with their own strengths and weaknesses, often trading compute efficiency with lesser accuracy in long-term forecast horizons or ability to capture multiple seasonal periods. More modern machine learning models and neural network-based architectures tend to generalize through forecasting horizons better, but often at great cost of compute resources making them inefficient for short forecast horizon forecasts. Though because of the potential for complexity of energy load patterns, both categories see benefits in task-specific tuning. By evaluating each forecasting model's accuracy across various context windows and forecasting horizons, this project presents a thorough comparison on which techniques are most relevant in the task-specific energy grid context and each forecast horizon.

Statistical, machine learning, and deep learning-based methods compared across 5 time series and 6 forecast horizon lengths.

Cody Hill

April 2024

Github Repo: https://github.com/chill0121/Energy_Grid_Forecasting

Table of Contents ¶


  • 1.Data Source Information
    • 1.1. Energy Load Data
    • 1.2. Weather Data
  • 2.Setup
    • 2.1. Environment Details for Reproducility
    • 2.2. Importing the Data
  • 3.Data Preprocessing
    • 3.1. First Looks
    • 3.2. Reformatting
    • 3.3. Discovery
    • 3.4. Selecting Electrical Load Zones
    • 3.5. Missing Data
      • 3.5.1. Missing Dates - Where Has the Time Gone?
      • 3.5.2. Missing Weather Data
    • 3.6. Outlier Investigation
    • 3.7. Merging the Datasets
  • 4.Feature Engineering
    • 4.1. Date Features
    • 4.2. Rolling Mean
    • 4.3. Lagged Features
    • 4.4. Fourier Transform
    • 4.5. First and Second-Order Derivatives/Differencing
  • 5.Exploratory Data Analysis (EDA)
  • 6.Train, Validation, and Test Split
    • 6.1. Validation Set Conundrum
  • 7.Model Setup
    • 7.1. Baseline Models
      • 7.1.1. Seasonal Naive Forecasting
      • 7.1.2. Mean
    • 7.2. Statistical Models
      • 7.2.1. Seasonal Autoregressive Integrated Moving Average (SARIMA)
      • 7.2.2. Holt-Winters' Method
      • 7.2.3. Multiple Seasonal-Trend decomposition using LOESS (MSTL)
    • 7.3. Machine Learning Models
      • 7.3.1. Support Vector Regression
      • 7.3.2. XGBoost
    • 7.4. Deep Learning Models
      • 7.4.1. Neural Hierarchical Interpolation (N-HiTS)
      • 7.4.2. Temporal Fusion Transformer
  • 8.Results Evaluation
    • 8.1. Evaluation Helper Functions
    • 8.2. Generate Results
    • 8.3. Results Tables
    • 8.4. Results Plots
      • 8.4.1. Plot Results by Model and Horizon
      • 8.4.2. Plot Forecasts vs True Values
  • Appendix A - Online References

1. Data Source Information ¶


1.1. Energy Load Data: ¶

Electrical load and generation data was gathered from Pennsylvania-New Jersey-Maryland Interconnection, now called PJM, which is a regional transmission organization (RTO) that coordinates movement of electricity in 13 states and Washington, DC.

Data Info:

  • Detailed dataset documentation can be found at https://dataminer2.pjm.com/feed/hrl_load_metered/definition.
  • Accessed and obtained on 3/28/2024.
  • Ranges from 1/1/1993 to 3/27/2024.
  • Hourly frequency.

  • Feature Information

    • Datetime Beginning UTC: Hourly datetime in UTC.
    • Datetime Beginning EPT: Hourly datetime in EPT.
    • NERC Region: North American Electric Reliability Corporation (NERC) Regions
    • Market Region: Designation of utility market data belongs to.
    • Transmission Zone: Zone where the electricity transmission occurs. Flagged by a string of the energy company(s)' name.
    • Load Area: Metered electric distribution company.
    • MW: Electrical load measured in Megawatts (MW).
    • Company Verified: Indicates whether the metered load has been verified by the electric distribution company.
Back to Table of Contents¶

1.2. Weather Data: ¶

Weather data was sourced from National Oceanic and Atmospheric Administration's (NOAA) Regional Climate Center (RCC) Applied Climate Information System (ACIS). The data was manually downloaded from https://scacis.rcc-acis.org/# by selecting a weather station and choosing relevant dates and features with the daily data listings form.

Data Info:

  • Detailed dataset documentation can be found at https://www.rcc-acis.org/docs_datasets.html.
  • Accessed and obtained on 4/10/2024
  • Daily frequency.

    • Feature Information
      • Date: Daily data with date range matching the each selected zone (selected in section 3.4.).
      • MaxTemperature: Maximum temperature recorded at that weather station that day.
      • MinTemperature: Minimum temperature recorded at that weather station that day.
      • AvgTemperature: Average of the high and low temperature.
      • Precipitation: 24-hour total precipitation amount using a rain gauge, measured in inches.
      • Zone: Electric distributor matched from other dataset.

Ideally, the sampling rate would match the hourly load data, however, only daily weather data is available at this time.

A weather station was chosen for each zone (selected in section 3.4.). The criteria considered for the weather station was:

  • Geographically central to the listed zone.
  • Weather station uptime reliability (airport weather stations often chosen for this quality.)

Weather Stations Chosen:

  • AE: MILLVILLE MUNICIPAL AIRPORT, NJ
  • CE: CHICAGO OHARE INTL AP, IL
  • DOM: RICHMOND INTERNATIONAL AP, VA
  • JC: NEW BRUNSWICK 3 SE, NJ
  • PEP: WASHINGTON REAGAN NATIONAL AIRPORT, VA

Zones

(Map Source: https://www.pjm.com/-/media/about-pjm/pjm-zones.ashx)

Back to Table of Contents¶

2. Setup ¶


In [278]:
import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.dates import AutoDateLocator, AutoDateFormatter

from scipy.fft import fft
from sklearn.model_selection import GridSearchCV, PredefinedSplit
from sklearn.preprocessing import StandardScaler
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import statsmodels.api as sm
from statsforecast.models import MSTL, AutoARIMA, HoltWinters
from statsforecast.utils import ConformalIntervals
#from prophet import Prophet
from sklearn.svm import SVR
import xgboost as xgb
from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS, TFT
from neuralforecast.losses.pytorch import DistributionLoss
import torch

Note:

Unable to resolve error that causes the kernel to crash when running N-HiTS when importing both statsforecast.models import MSTL, AutoARIMA, HoltWinters AND neuralforecast.models import NHITS, TFT.

In the N-HiTS section, the model can be trained, saved, and commented out. Then return the statsforecast models and use the load code block in the N-HiTS section to load predictions.

Back to Table of Contents¶

2.1. Environment Information for Reproducibility: ¶

In [279]:
print(f"Python version: {sys.version}")

packages = [pd, np, sns, sm, xgb, torch]
for package in packages:
    print(f"{str(package).partition('from')[0]} using version: {package.__version__}")
Python version: 3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]
<module 'pandas'  using version: 2.1.4
<module 'numpy'  using version: 1.26.4
<module 'seaborn'  using version: 0.13.2
<module 'statsmodels.api'  using version: 0.14.1
<module 'xgboost'  using version: 2.0.3
<module 'torch'  using version: 2.2.2
Back to Table of Contents¶

2.2. Importing the Data: ¶

In [280]:
# Set directories
current_wdir = os.getcwd()
data_folder_hourly = current_wdir + '/Data/Hourly_Load'
data_folder_weather = current_wdir + '/Data/Weather'
In [281]:
# Add and sort all filenames from each folder path.
file_path_load = [f'{data_folder_hourly}/{file}' for file in os.listdir(data_folder_hourly) if '.csv' in file]
file_path_load = sorted(file_path_load)

file_path_weather = [f'{data_folder_weather}/{file}' for file in os.listdir(data_folder_weather) if '.csv' in file]
file_path_weather = sorted(file_path_weather)

# Iterate through filenames and add them to dataframe.
# pd.read_csv can unzip as it goes with compression argument.
load_df = pd.concat([pd.read_csv(file, compression = 'gzip') for file in file_path_load], join = 'outer', ignore_index = False, axis = 0)
weather_df_list = []
for filename in file_path_weather:
    # delimiter has whitespace after comma, specified here as \s+.
    temp_df = pd.read_csv(filename, compression = 'gzip', delimiter = ',\s+', engine = 'python')
    temp_df['zone']= os.path.basename(filename).replace('.csv.gz', '')
    weather_df_list.append(temp_df)
weather_df = pd.concat(weather_df_list, join = 'outer', ignore_index = False, axis = 0)
Back to Table of Contents¶

3. Data Preprocessing ¶


3.1. First Looks: ¶

In [282]:
display(load_df)
display(load_df.dtypes)
datetime_beginning_utc datetime_beginning_ept nerc_region mkt_region zone load_area mw is_verified
0 1/1/1993 5:00:00 AM 1/1/1993 12:00:00 AM PJM RTO PJM BC BC 2358.000 True
1 1/1/1993 5:00:00 AM 1/1/1993 12:00:00 AM PJM RTO PJM CNCT AE 855.000 True
2 1/1/1993 5:00:00 AM 1/1/1993 12:00:00 AM PJM RTO PJM CNCT DPL 1150.000 True
3 1/1/1993 5:00:00 AM 1/1/1993 12:00:00 AM PJM RTO PJM GPU JC 1632.000 True
4 1/1/1993 5:00:00 AM 1/1/1993 12:00:00 AM PJM RTO PJM GPU ME 929.000 True
... ... ... ... ... ... ... ... ...
62605 3/28/2024 3:00:00 AM 3/27/2024 11:00:00 PM RFC MIDATL RECO RECO 131.877 False
62606 3/28/2024 3:00:00 AM 3/27/2024 11:00:00 PM RFC MIDATL PEP SMECO 355.931 False
62607 3/28/2024 3:00:00 AM 3/27/2024 11:00:00 PM RFC MIDATL PL UGI 101.478 True
62608 3/28/2024 3:00:00 AM 3/27/2024 11:00:00 PM RFC MIDATL AE VMEU 70.051 False
62609 3/28/2024 3:00:00 AM 3/27/2024 11:00:00 PM RTO RTO RTO RTO 80180.923 False

5446243 rows × 8 columns

datetime_beginning_utc     object
datetime_beginning_ept     object
nerc_region                object
mkt_region                 object
zone                       object
load_area                  object
mw                        float64
is_verified                  bool
dtype: object
  • There's a few columns encoding geographic region data with the regions indicating larger areas and zone/load_area denoting more specific areas. Zone will be useful to assign the weather data to.
  • One datetime can be dropped. The vast majority of these regions are in the US Eastern timezone so 'datetime_beginning_ept' will be useful to keep for interpretability.
  • is_verified denotes weather or not an electric company employee visually confirmed the load number on their meter. This will be ignored for this project.

Some datatype reformatting will have to be done on the date.

In [283]:
display(weather_df)
display(weather_df.dtypes)
Date MaxTemperature MinTemperature AvgTemperature Precipitation zone
0 2015-01-01 43 17 30.0 0.00 AE
1 2015-01-02 45 24 34.5 0.00 AE
2 2015-01-03 47 21 34.0 0.62 AE
3 2015-01-04 64 46 55.0 0.26 AE
4 2015-01-05 50 25 37.5 0.00 AE
... ... ... ... ... ... ...
11417 2024-04-05 57 40 48.5 0.00 PEP
11418 2024-04-06 57 41 49.0 0.00 PEP
11419 2024-04-07 65 43 54.0 0.00 PEP
11420 2024-04-08 71 44 57.5 0.00 PEP
11421 2024-04-09 78 51 64.5 0.00 PEP

32400 rows × 6 columns

Date              object
MaxTemperature    object
MinTemperature    object
AvgTemperature    object
Precipitation     object
zone              object
dtype: object

All of these will potentially be useful as covariates in the models.

The sample rates between datasets are mismatched - electric load data is hourly and this weather data is daily. This will require either load resampling and aggregation or assigning this weather data to each hour as a static daily feature.

Back to Table of Contents¶

3.2. Reformatting: ¶

Convert the dates into datatime objects.

In [284]:
# Convert to datetime. The two dataframes share column names.
date_cols = ['datetime_beginning_utc', 'datetime_beginning_ept']
load_df[date_cols] = load_df[date_cols].apply(pd.to_datetime, format = '%m/%d/%Y %I:%M:%S %p', utc = False)
weather_df['Date'] = weather_df['Date'].apply(pd.to_datetime, format = '%Y-%m-%d', utc = False)
Back to Table of Contents¶

3.3. Discovery: ¶

Since the zone column will likely be used to group the data by, we can start by grouping the data by zone and date and plot over time.

In [285]:
load_zone_df = load_df[['datetime_beginning_ept', 'zone', 'mw']].groupby(['datetime_beginning_ept', 'zone'], as_index = False).sum()
In [286]:
ax, fig = plt.subplots(figsize = (14, 8))
ax = sns.lineplot(load_zone_df, x = 'datetime_beginning_ept', y = 'mw',  hue = 'zone', alpha = 0.7, lw = 1)

Zone 'RTO' seems to be on a different scale and looks like an aggregate sum of all the zones. This is likely to be the case because 'RTO' stands for Regional Transmission Organization which is exactly where this data came from.

Let's confirm this by dropping the 'RTO' zone, summing the remaining zones and overlaying it on the previous plot.

In [287]:
load_noRTO_df = load_zone_df.drop(load_zone_df[load_zone_df['zone'] == 'RTO'].index)
In [288]:
ax, fig = plt.subplots(figsize = (14, 8))
# Black - Sum of ALL
ax = sns.lineplot(load_zone_df[['datetime_beginning_ept', 'mw']].groupby(['datetime_beginning_ept'], as_index = False).sum(), x = 'datetime_beginning_ept', y = 'mw', alpha = 0.7, lw = 1, c = 'black')
# Purple - Sum of All but NO RTO
ax = sns.lineplot(load_noRTO_df[['datetime_beginning_ept', 'mw']].groupby(['datetime_beginning_ept'], as_index = False).sum(), x = 'datetime_beginning_ept', y = 'mw', alpha = 0.7, lw = 1, c = 'purple')
ax = sns.lineplot(load_zone_df, x = 'datetime_beginning_ept', y = 'mw',  hue = 'zone', alpha = 0.7, lw = 1)

Besides a few artifacts in the data, it looks like that was the case. We can get rid of RTO since forecasting an aggregated time series signal isn't the goal here.

Next, more information is needed about how many zones there are and where they are located.

In [289]:
sorted(load_df['zone'].unique())
Out[289]:
['AE',
 'AEP',
 'AP',
 'ATSI',
 'BC',
 'CE',
 'CNCT',
 'DAY',
 'DEOK',
 'DOM',
 'DPL',
 'DUQ',
 'EKPC',
 'GPU',
 'JC',
 'ME',
 'OVEC',
 'PE',
 'PEP',
 'PL',
 'PN',
 'PS',
 'RECO',
 'RTO']

Here we can manually look at these zones and research where they are located. This information is important to align the weather data. Most of them easily identified with a map from the data source (can be found in section 3.4).

Zones:

  • AE - New Jersey
  • AEP - Ohio
  • AP - West Virginia
  • ATSI - Ohio
  • BC - Maryland
  • CE - Northern Illinois
  • CNCT - Connecticut
  • DAY - Ohio
  • DEOK - Ohio
  • DOM - Virginia
  • DPL - Connecticut
  • DUQ - Pennsylvania
  • EKPC - Kentucky
  • GPU - New Jersey
  • JC - New Jersey
  • ME - Pennsylvania
  • OVEC - Ohio
  • PE - Pennsylvania
  • PEP - Maryland
  • PL - Pennsylvania
  • PN - Pennsylvania
  • PS - New Jersey
  • RECO - New Jersey

  • RTO - Regional Transmission Organization (PJM) - Aggregated Sum

Forecasting ALL of these zones is beyond the scope and computational limitations of this project, because of this let's build out a function to visualize each zone and choose a few that are interesting that will be used to build the models and perform predictions on.

In [290]:
def load_zone_plot(zone_df, outliers=None, figsize=(16,16), cmap=None):
    """Finds all unique zone names and prints to output (n x 2) subplot array of line plots where x = datetime vs y = MW. 
    Can handle any length of list of strings.

    Parameters:
        zones_df (list): Pandas dataframe grouped by zone and time such as load_zone_df.
    Optional Parameters:
        outliers (dict): List of outlier indexes that will be plotted as points on top of line plot. Has the form of {'Zone' : [idx, idx, ...]}
        figsize (tuple): Default = (16,16)
    Returns:
        None. Prints plot to output."""
    
    zone_list = zone_df['zone'].unique()

    color_pal = ['#005d5d', '#017d66']
    
    fig, ax = plt.subplots(int(np.ceil(len(zone_list)/2)), 2, figsize = figsize, sharey = False, sharex = False)
    for i, zone in enumerate(zone_list):
        #print(f'{i} {zone} -- [{i//2}, {i%2}]')
        # i iterates through axes rows and columns with // and % operators 
        # ([0,0] to [0,1] to [1,0] ... etc).
        sns.lineplot(load_zone_df.loc[load_zone_df['zone'] == zone], ax = ax[i//2, i%2], 
                     x = 'datetime_beginning_ept', y = 'mw', 
                     alpha = 0.7, lw = 1,
                     # Alternates colors between rows.
                     c = cmap[zone] if cmap else (color_pal[int(i % 4 < 2)]))
        ax[i//2, i%2].set_title(zone, y = 1)
        ax[i//2, i%2].set_xlabel(None)
        ax[i//2, i%2].set_ylabel(None)

        # If given outlier indexes, plot them.
        if outliers:
            sns.scatterplot(load_zone_df.iloc[outliers.get(zone)], ax = ax[i//2, i%2],
                            x = 'datetime_beginning_ept', y = 'mw',
                            c = 'black')
        
    plt.subplots_adjust(wspace = 0.3, hspace = 1)
    fig.suptitle('Hourly Load of Each Zone')
    fig.supxlabel('Datetime')
    fig.supylabel('Megawatts (MW)')
    fig.tight_layout()

    # Check if odd number of plots and delete last subplot if true.
    if len(zone_list) % 2 != 0:
        fig.delaxes(ax[i // 2, (len(zone_list) % 2)])
    
    return None
In [291]:
load_zone_plot(load_zone_df)
Back to Table of Contents¶

3.4. Selecting Electrical Load Data: ¶

The following zones will be chosen for the remainder of this project.

  • AE: Atlantic City Electric Co.
  • CE: ComEd
  • DOM: Dominion
  • JC: Jersey Central Power & Light
  • PEP: Potomac Electric Power Co.
In [292]:
zones = ['AE', 'CE', 'DOM', 'JC', 'PEP']
load_zone_df = load_zone_df.loc[load_zone_df['zone'].isin(zones)]
load_zone_df.reset_index(drop = True, inplace = True)

# Color map zones for consistency.
palette = ['#570408', '#005d5d', '#1192e8', '#fa4d56', '#012749']
zones_palette = dict(zip(zones, palette))

Note about the weather data:

After choosing zones to work with, weather data was sourced from National Oceanic and Atmospheric Administration's (NOAA) Regional Climate Center (RCC) Applied Climate Information System (ACIS). Detailed information can be found in section 1. For continuity the weather data was imported above along with the other datasets after the zones were chosen.

Ideally, the sampling rate would match the hourly load data, however, only daily weather data is available at this time.

A weather station was chosen for each zone. The criteria considered for the weather station was:

  • Geographically central to the listed zone.
  • Weather station uptime reliability (airport weather stations often chosen for this quality.)

Weather Stations Chosen:

  • AE: MILLVILLE MUNICIPAL AIRPORT, NJ
  • CE: CHICAGO OHARE INTL AP, IL
  • DOM: RICHMOND INTERNATIONAL AP, VA
  • JC: NEW BRUNSWICK 3 SE, NJ
  • PEP: WASHINGTON REAGAN NATIONAL AIRPORT, VA

Zones

(Map Source: https://www.pjm.com/-/media/about-pjm/pjm-zones.ashx)

Back to Table of Contents¶

3.5. Missing Data: ¶

3.5.1. Missing Dates - Where Has the Time Gone?: ¶

Some but not all of the models require the sample frequency to have no missing samples (in this case hours in load_df and days in weather_df - they will be joined later where the weather data will become hourly static features.).

Method:

  1. Store any missing datetimes.
  2. For any found missing datetimes, the average of the previous and next observed data is used to fill in the missing data.
In [293]:
# Could also use .asfreq('H') which would assign any missing rows to NaN then identify rows.
print('######### Check for Missing Weather Dates #########')
for zone in zones:
    # Find min and max dates for that zone.
    date_min = weather_df.loc[weather_df['zone'] == zone]['Date'].min()
    date_max = weather_df.loc[weather_df['zone'] == zone]['Date'].max()
    print(f'{zone} - {date_min} | {date_max}')
    # Find missing dates.
    # Weather DF - Set freq to 'D' for days.
    miss_date = pd.date_range(start = date_min, end = date_max, freq = 'D').difference(weather_df['Date'])
    print(f"{'':<5}Missing Dates: {miss_date if len(miss_date) > 0 else False}")

print('\n######### Check for Missing Energy Dates #########')
for zone in zones:
    # Find min and max dates for that zone.
    date_min = load_zone_df.loc[load_zone_df['zone'] == zone]['datetime_beginning_ept'].min()
    date_max = load_zone_df.loc[load_zone_df['zone'] == zone]['datetime_beginning_ept'].max()
    print(f'{zone} - {date_min} | {date_max}')
    # Find missing dates. 
    # Load DF - Set freq to 'H' for hours.
    miss_date = pd.date_range(start = date_min, end = date_max, freq = 'H').difference(
        load_zone_df['datetime_beginning_ept'].loc[load_zone_df['zone'] == zone])
    print(f"{'':<5}Missing Dates: {miss_date if len(miss_date) > 0 else False}")

    # Check to see if missing dates exist.
    # Fill missing dates with average of two surrounding times.
    # TODO: There's definitely some optimizations to be had here.
    if len(miss_date) > 0:
        print(f"{'':<5}============================================================")
        print(f"{'':<5}Adding Average of Surrounding Times to {zone} Missing Dates")
        # Find index of one hour before and after missing
        miss_pre_index = (load_zone_df.loc[(load_zone_df['zone'] == zone) & 
                                           (load_zone_df['datetime_beginning_ept'].isin(
                                               miss_date - pd.Timedelta(hours = 1)))]).index
        miss_post_index = (load_zone_df.loc[(load_zone_df['zone'] == zone) & 
                                            (load_zone_df['datetime_beginning_ept'].isin(
                                                miss_date + pd.Timedelta(hours = 1)))]).index
        # Calculate average of two surround times.
        miss_avg = [np.round(np.mean((a,b)), 3) for a, b in 
                    (zip(load_zone_df.iloc[miss_pre_index]['mw'], load_zone_df.iloc[miss_post_index]['mw']))]
        # Create DF to hold values and concat to main DF.
        temp_df = pd.DataFrame({'datetime_beginning_ept' : miss_date,
                    'zone' : zone,
                    'mw' : miss_avg})
        load_zone_df = pd.concat([load_zone_df, temp_df], join = 'outer', ignore_index = False, axis = 0)
        print(f"{'':<5}{'='*5} Done!\n")
######### Check for Missing Weather Dates #########
AE - 2015-01-01 00:00:00 | 2024-04-09 00:00:00
     Missing Dates: False
CE - 2004-05-01 00:00:00 | 2024-04-09 00:00:00
     Missing Dates: False
DOM - 2005-05-01 00:00:00 | 2024-04-09 00:00:00
     Missing Dates: False
JC - 2015-01-01 00:00:00 | 2024-04-10 00:00:00
     Missing Dates: False
PEP - 1993-01-01 00:00:00 | 2024-04-09 00:00:00
     Missing Dates: False

######### Check for Missing Energy Dates #########
AE - 2015-01-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: DatetimeIndex(['2015-03-08 02:00:00', '2016-03-13 02:00:00',
               '2017-03-12 02:00:00', '2018-03-11 02:00:00',
               '2019-03-10 02:00:00', '2020-03-08 02:00:00',
               '2021-03-14 02:00:00', '2022-03-13 02:00:00',
               '2023-03-12 02:00:00', '2024-03-10 02:00:00'],
              dtype='datetime64[ns]', freq=None)
     ============================================================
     Adding Average of Surrounding Times to AE Missing Dates
     ===== Done!

CE - 2004-05-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: DatetimeIndex(['2005-04-03 02:00:00', '2006-04-02 02:00:00',
               '2007-03-11 02:00:00', '2008-03-09 02:00:00',
               '2009-03-08 02:00:00', '2010-03-14 02:00:00',
               '2011-03-13 02:00:00', '2012-03-11 02:00:00',
               '2013-03-10 02:00:00', '2014-03-09 02:00:00',
               '2015-03-08 02:00:00', '2016-03-13 02:00:00',
               '2017-03-12 02:00:00', '2018-03-11 02:00:00',
               '2019-03-10 02:00:00', '2020-03-08 02:00:00',
               '2021-03-14 02:00:00', '2022-03-13 02:00:00',
               '2023-03-12 02:00:00', '2024-03-10 02:00:00'],
              dtype='datetime64[ns]', freq=None)
     ============================================================
     Adding Average of Surrounding Times to CE Missing Dates
     ===== Done!

DOM - 2005-05-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: DatetimeIndex(['2006-04-02 02:00:00', '2007-03-11 02:00:00',
               '2008-03-09 02:00:00', '2009-03-08 02:00:00',
               '2010-03-14 02:00:00', '2011-03-13 02:00:00',
               '2012-03-11 02:00:00', '2013-03-10 02:00:00',
               '2014-03-09 02:00:00', '2015-03-08 02:00:00',
               '2016-03-13 02:00:00', '2017-03-12 02:00:00',
               '2018-03-11 02:00:00', '2019-03-10 02:00:00',
               '2020-03-08 02:00:00', '2021-03-14 02:00:00',
               '2022-03-13 02:00:00', '2023-03-12 02:00:00',
               '2024-03-10 02:00:00'],
              dtype='datetime64[ns]', freq=None)
     ============================================================
     Adding Average of Surrounding Times to DOM Missing Dates
     ===== Done!

JC - 2015-01-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: DatetimeIndex(['2015-03-08 02:00:00', '2016-03-13 02:00:00',
               '2017-03-12 02:00:00', '2018-03-11 02:00:00',
               '2019-03-10 02:00:00', '2020-03-08 02:00:00',
               '2021-03-14 02:00:00', '2022-03-13 02:00:00',
               '2023-03-12 02:00:00', '2024-03-10 02:00:00'],
              dtype='datetime64[ns]', freq=None)
     ============================================================
     Adding Average of Surrounding Times to JC Missing Dates
     ===== Done!

PEP - 1993-01-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: DatetimeIndex(['1993-04-04 02:00:00', '1994-04-03 02:00:00',
               '1995-04-02 02:00:00', '1996-04-07 02:00:00',
               '1997-04-06 02:00:00', '1998-04-05 02:00:00',
               '1999-04-04 02:00:00', '2000-04-02 02:00:00',
               '2001-04-01 02:00:00', '2002-04-07 02:00:00',
               '2003-04-06 02:00:00', '2004-04-04 02:00:00',
               '2005-04-03 02:00:00', '2006-04-02 02:00:00',
               '2007-03-11 02:00:00', '2008-03-09 02:00:00',
               '2009-03-08 02:00:00', '2010-03-14 02:00:00',
               '2011-03-13 02:00:00', '2012-03-11 02:00:00',
               '2013-03-10 02:00:00', '2014-03-09 02:00:00',
               '2015-03-08 02:00:00', '2016-03-13 02:00:00',
               '2017-03-12 02:00:00', '2018-03-11 02:00:00',
               '2019-03-10 02:00:00', '2020-03-08 02:00:00',
               '2021-03-14 02:00:00', '2022-03-13 02:00:00',
               '2023-03-12 02:00:00', '2024-03-10 02:00:00'],
              dtype='datetime64[ns]', freq=None)
     ============================================================
     Adding Average of Surrounding Times to PEP Missing Dates
     ===== Done!

weather_df had no missing dates, so a solution to add date rows was not implemented.

Now to recheck load_df to make sure our data injection worked.

In [294]:
for zone in zones:
    # Find min and max dates for that zone.
    date_min = load_zone_df.loc[load_zone_df['zone'] == zone]['datetime_beginning_ept'].min()
    date_max = load_zone_df.loc[load_zone_df['zone'] == zone]['datetime_beginning_ept'].max()
    print(f'{zone} - {date_min} | {date_max}')
    # Find missing dates. 
    # Load DF - Set freq to 'H' for hours.
    miss_date = pd.date_range(start = date_min, end = date_max, freq = '1H').difference(
        load_zone_df['datetime_beginning_ept'].loc[load_zone_df['zone'] == zone])
    print(f"{'':<5}Missing Dates: {miss_date if len(miss_date) > 0 else False}")

# Sort by date and reset index.
load_zone_df = load_zone_df.sort_values(by = 'datetime_beginning_ept').reset_index(drop = True)
AE - 2015-01-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: False
CE - 2004-05-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: False
DOM - 2005-05-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: False
JC - 2015-01-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: False
PEP - 1993-01-01 00:00:00 | 2024-03-27 23:00:00
     Missing Dates: False
Back to Table of Contents¶

3.5.2. Missing Weather Data: ¶

Though the weather dataset was not missing any days, according to NOAA, the data logged may have flags for certain circumstances or measurement errors.

According to their glossary, weather station data may contain the following flags:

  • M = Missing (No measurement.)
  • T = Trace Amount (Precipitation)

These values need to be located and handled in different ways.

  • M: A window of 6 days (3 before and 3 after) will be created and the average measurement taken.
  • T: T will be replaced with a comparable measurement on this scale, 0.01.
In [295]:
# Find unique non-numeric labels and assign values depending on col
for col in weather_df.columns[1:5]:
    search_str = set(weather_df[col].unique().astype(str))
    unique_strings = {x for x in search_str if x.isalpha()}
    print(f"{col} | Unique Strings: {unique_strings if len(unique_strings) > 0 else False}")

    if col == 'Precipitation':
        weather_df[col] = weather_df[col].mask(weather_df[col] == 'T', '0.01') # Trace
        weather_df[col] = weather_df[col].mask(weather_df[col] == 'M', '0.00') # Missing
        continue # Move on to next iteration.
    
    # Can handle Min, Max, Avg columns the same.
    # Take surrounding 6 values (by date) and average them to get missing value.
    m_index = weather_df.loc[weather_df[col] == 'M']

    # TODO: Perform and exorcism on this section....
    # Not ideal, could arise that Max < Min, or Avg isn't calc from Min and Max, etc.
    # AE is the only zone with M values, hard coded for now.
    m_df = pd.DataFrame({
        '3' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] + pd.Timedelta(days=3))) & (weather_df['zone'] == 'AE')].to_list(),
        '2' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] + pd.Timedelta(days=2))) & (weather_df['zone'] == 'AE')].to_list(),
        '1' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] + pd.Timedelta(days=1))) & (weather_df['zone'] == 'AE')].to_list(),
        '0' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] + pd.Timedelta(days=0))) & (weather_df['zone'] == 'AE')].to_list(),
        '-1' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] - pd.Timedelta(days=1))) & (weather_df['zone'] == 'AE')].to_list(),
        '-2' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] - pd.Timedelta(days=2))) & (weather_df['zone'] == 'AE')].to_list(),
        '-3' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] - pd.Timedelta(days=3))) & (weather_df['zone'] == 'AE')].to_list(),
        },index = m_index.index)

    m_df = m_df.apply(pd.to_numeric, errors = 'coerce')
    m_df['new_values'] = (np.round(m_df.mean(axis = 1)))
    # Assign new average to missing cells.
    weather_df.iloc[m_df.index, weather_df.columns.get_loc(col)] = m_df['new_values']

# Return original dtypes to columns and reset index.
weather_df['MaxTemperature'] = weather_df['MaxTemperature'].astype('int')
weather_df['MinTemperature'] = weather_df['MinTemperature'].astype('int')
weather_df.reset_index(drop = True, inplace = True)
MaxTemperature | Unique Strings: {'M'}
MinTemperature | Unique Strings: {'M'}
AvgTemperature | Unique Strings: {'M'}
Precipitation | Unique Strings: {'M', 'T'}

To quickly check if any remain, the numeric columns can be cast as numeric type and any remaining string characters will automatically be change to NaN. Then each will be checked for any NaN values.

In [296]:
# Change numeric columns to numeric and coerce all remaining errors/non-numbers to NaN.
weather_df[['MaxTemperature', 'MinTemperature', 'AvgTemperature', 'Precipitation']] = weather_df[
    ['MaxTemperature', 'MinTemperature', 'AvgTemperature', 'Precipitation']].apply(pd.to_numeric, errors = 'coerce')
weather_df.isna().sum()
Out[296]:
Date              0
MaxTemperature    0
MinTemperature    0
AvgTemperature    0
Precipitation     0
zone              0
dtype: int64

Looks like it has been cleaned up.

Now that they are in their proper numeric format, .describe() can be used to get a better picture of their distributions.

In [297]:
display(weather_df.describe())
weather_df.dtypes
Date MaxTemperature MinTemperature AvgTemperature Precipitation
count 32400 32400.000000 32400.000000 32400.000000 32400.000000
mean 2013-07-11 13:54:42.666666496 66.006204 47.391914 56.698920 0.121390
min 1993-01-01 00:00:00 -10.000000 -23.000000 -16.500000 0.000000
25% 2008-04-12 00:00:00 51.000000 34.000000 42.500000 0.000000
50% 2015-05-28 00:00:00 68.000000 48.000000 57.500000 0.000000
75% 2019-11-03 00:00:00 82.000000 63.000000 72.500000 0.050000
max 2024-04-10 00:00:00 105.000000 84.000000 93.500000 7.610000
std NaN 18.895985 17.544789 17.896425 0.340688
Out[297]:
Date              datetime64[ns]
MaxTemperature             int64
MinTemperature             int64
AvgTemperature           float64
Precipitation            float64
zone                      object
dtype: object

Nothing looks out of the ordinary now.

Back to Table of Contents¶

3.6. Outlier Investigation: ¶

From the previous plots a few outliers could be seen. No matter the source of these outliers, be it systematic like scheduled maintenance, system downtime after a storm, or measurement errors, it will be important to remove them before training the models.

First, let's take a closer look at the chosen zones.

In [298]:
load_zone_plot(load_zone_df, figsize = (14,14), cmap = zones_palette)

PEP, DOM, and AE each have obvious outliers. A method to detect them needs to be chosen. It seems a quantile threshold approach could be effective in this case.

Here a test an be done to get an idea how far the threshold needs to be set by listing small slices of quantiles.

In [299]:
load_zone_df.loc[load_zone_df['zone'] == 'PEP']['mw'].quantile([0.00005, 0.01, 0.99, 0.9995])
Out[299]:
0.00005    1755.867458
0.01000    2029.000000
0.99000    5626.516500
0.99950    6445.806363
Name: mw, dtype: float64

This looks right because from the PEP plot it can be seen most of the obvious outliers are somewhere below 1800 mw.

Here a detection method will be implemented.

  1. Create a tumbling window that chronologically captures 35 days at a time.
  2. Calculate the median mw value within that 35 day window.
  3. With the median, calculate an upper and lower bound threshold by taking:
    • $Median \pm (QuantileScalar * S)$
      • where, $S =$ Window Standard Deviation
  4. Log the index of any data point mw outside the threshold.
  5. The median of that window will be logged so the outlier can be replaced with it.
  6. The tumbling window will continue until the end of the dataset.

Along with this method a z-score method will be implemented as well and whichever method seems to capture more of the true outliers, will be used.

Z-Score Method:

  • Calculate the z-score for each data point and calibrate/set a threshold number where if the z-score is outside of it, that point is an outlier.
    • $Z = \frac{x-\mu}{\sigma}$
In [300]:
outliers_index = {}
outliers_z = {}
replacement_medians = {}
for zone in zones:
    outliers_index[zone] = []
    outliers_z[zone] = []
    zone_df = load_zone_df.loc[load_zone_df['zone'] == zone]
    # Find min and max dates for that zone.
    date_min = zone_df['datetime_beginning_ept'].min()
    date_max = zone_df['datetime_beginning_ept'].max()
    # Create tumbling window to locally detect outliers.
    window_size = 35 # Days
    left = date_min
    right = left + pd.Timedelta(days = window_size)
    
    # Tumbling window calculates outlier thresholds 
    # Two methods: Quantile_Thresh = (median +/- (scalar*SD)) or Zscore = z >/< thresh.
    # Wasn't aware of the df.rolling() window method until after, could use instead.
    for i in range(int(((date_max-date_min) / pd.Timedelta(days = window_size)) + 1)):
        if right > date_max:
            right = date_max

        window_df = zone_df.loc[(zone_df['datetime_beginning_ept'].between(left, right, inclusive = 'both'))]

        w_std = window_df['mw'].std()
        w_median = window_df['mw'].median()
        q_lower, q_upper = w_median - (5.75*w_std), w_median + (5.75*w_std) # quantile
        w_mean = window_df['mw'].mean()
        w_zscore = (window_df['mw'] - w_mean) / w_std # zscore

        # Quantile and zscore methods.
        outliers_temp = list(window_df.loc[(window_df['mw'] < q_lower) | (window_df['mw'] > q_upper)].index)
        outliers_z_temp = list(w_zscore.loc[(w_zscore < -5.5) | (w_zscore > 5.5)].index)
        # Store outliers if detected.
        if len(outliers_temp) > 0:
            outliers_index[zone].extend(outliers_temp)
            for i in outliers_temp:
                replacement_medians[i] = round(w_median, 3)
        if len(outliers_z_temp) > 0:
            outliers_z[zone].extend(outliers_z_temp)
        # Iterate window pointers
        left = right + pd.Timedelta(hours = 1)
        right = right + pd.Timedelta(days = window_size)
In [301]:
outliers_z
Out[301]:
{'AE': [598305, 626861, 671381, 715061, 730185],
 'CE': [366819, 583181, 626862, 671382, 758744],
 'DOM': [129939, 261988, 288196, 314405, 626863, 671383, 715063, 758745],
 'JC': [626864, 671384],
 'PEP': [598301, 642824, 671385, 730181, 758741]}
In [302]:
outliers_index
Out[302]:
{'AE': [598305, 626861, 671381, 715061, 730185],
 'CE': [583181, 626862, 758744],
 'DOM': [261988, 626863, 671383, 715063, 758745],
 'JC': [],
 'PEP': [598301, 642824, 730181]}

The load_zone_plot function was altered to take the index of any outliers and plot them with a black point.

In [303]:
load_zone_plot(load_zone_df, outliers_index, figsize = (14,14), cmap = zones_palette)

The quantile method was chosen as it seemed to calibrate to a better result with this data.

Now to replace the found outliers with the median of the window that was stored earlier.

In [304]:
for zone in zones:
    display(load_zone_df.iloc[outliers_index.get(zone)])

load_zone_df.loc[replacement_medians.keys(), 'mw'] = list(replacement_medians.values())
    
# for zone, index in outliers_index.items():
#     load_zone_df.drop(outliers_index.get(zone), inplace = True)
datetime_beginning_ept zone mw
598305 2020-03-08 01:00:00 AE 63.719
626861 2020-11-01 01:00:00 AE 1731.123
671381 2021-11-07 01:00:00 AE 2180.259
715061 2022-11-06 01:00:00 AE 1665.363
730185 2023-03-12 01:00:00 AE 68.317
datetime_beginning_ept zone mw
583181 2019-11-03 01:00:00 CE 17586.314
626862 2020-11-01 01:00:00 CE 15989.914
758744 2023-11-05 01:00:00 CE 16491.537
datetime_beginning_ept zone mw
261988 2010-11-07 01:00:00 DOM 17494.000
626863 2020-11-01 01:00:00 DOM 18249.653
671383 2021-11-07 01:00:00 DOM 20633.260
715063 2022-11-06 01:00:00 DOM 19902.046
758745 2023-11-05 01:00:00 DOM 21330.288
datetime_beginning_ept zone mw
datetime_beginning_ept zone mw
598301 2020-03-08 01:00:00 PEP 431.527
642824 2021-03-14 01:00:00 PEP 354.625
730181 2023-03-12 01:00:00 PEP 427.420
In [305]:
# To double-check that they were properly replaced.
# load_zone_df.loc[replacement_medians.keys()]

After reviewing the plot after outlier replacement, a few remaining data points were still there. Likely falling within the threshold because the previously removed outliers were skewing the distribution. The previous code could be altered to rerun and capture a second or third-wave but the remaining outliers could also just be captured by visuals and sorting the data.

The remainder will be manually removed for now and replaced with the median using the same process as before.

In [306]:
display(load_zone_df.loc[load_zone_df['zone'] == 'AE'].sort_values('mw', ascending = True).head(3))
display(load_zone_df.loc[load_zone_df['zone'] == 'PEP'].sort_values('mw', ascending = True).head(5))
display(load_zone_df.loc[load_zone_df['zone'] == 'DOM'].sort_values('mw', ascending = True).head(3))

outliers_manual = [642823, 642830, 53319, 93914, 93915, 93916, 367978,  598310, 730188, 93913, 93917, 93912]
manual_medians = []

for outlier in outliers_manual:
    zone = load_zone_df.loc[outlier]['zone']
    zone_df = load_zone_df.loc[load_zone_df['zone'] == zone]
    # Find min and max dates for that zone.
    date_min = zone_df['datetime_beginning_ept'].min()
    date_max = zone_df['datetime_beginning_ept'].max()
    # Create indow to calculate median.
    left = zone_df.loc[outlier]['datetime_beginning_ept'] - pd.Timedelta(days = 17)
    right = zone_df.loc[outlier]['datetime_beginning_ept'] + pd.Timedelta(days = 17)
    window_df = zone_df.loc[(zone_df['datetime_beginning_ept'].between(left, right, inclusive = 'both'))]
    w_median = window_df['mw'].median()

    manual_medians.append(round(w_median, 3))

load_zone_df.loc[outliers_manual, 'mw'] = manual_medians
datetime_beginning_ept zone mw
642823 2021-03-14 01:00:00 AE 62.426
605926 2020-05-10 14:00:00 AE 475.194
605923 2020-05-10 13:00:00 AE 475.531
datetime_beginning_ept zone mw
642830 2021-03-14 02:00:00 PEP 1406.669
53319 1999-01-31 15:00:00 PEP 1505.000
93914 2003-09-19 02:00:00 PEP 1578.554
93915 2003-09-19 03:00:00 PEP 1591.057
93916 2003-09-19 04:00:00 PEP 1609.728
datetime_beginning_ept zone mw
367978 2014-11-18 03:00:00 DOM 4724.204
283165 2011-08-28 04:00:00 DOM 5516.000
283162 2011-08-28 03:00:00 DOM 5519.000
In [307]:
# Re-check
# load_zone_df.loc[outliers_manual]

Now to plot the results of the efforts.

In [308]:
load_zone_plot(load_zone_df, figsize = (14,14), cmap = zones_palette)

Looks much better!

Back to Table of Contents¶

3.7. Merging the Datasets: ¶

It is now time to join the datasets into one and duplicate the weather data to fit into the load dataset's hourly frequency.

In [309]:
load_zone_df.rename(columns = {'datetime_beginning_ept' : 'Date'}, inplace = True)
load_zone_df.reset_index(drop = True, inplace = True)
In [310]:
# Merge
load_df = load_zone_df.merge(weather_df, on = ['Date', 'zone'], how = 'left')

# Sort values by zone and date so we can use .ffill()
load_df.sort_values(by = ['zone', 'Date'], inplace = True)
display(load_df)

# Fill forward weather data up until next day.
load_df.ffill(inplace = True)
# Confirm fill worked.
display(load_df.tail(26))
Date zone mw MaxTemperature MinTemperature AvgTemperature Precipitation
371136 2015-01-01 00:00:00 AE 1180.082 43.0 17.0 30.0 0.0
371143 2015-01-01 01:00:00 AE 1129.988 NaN NaN NaN NaN
371149 2015-01-01 02:00:00 AE 1088.828 NaN NaN NaN NaN
371151 2015-01-01 03:00:00 AE 1064.563 NaN NaN NaN NaN
371157 2015-01-01 04:00:00 AE 1060.820 NaN NaN NaN NaN
... ... ... ... ... ... ... ...
775991 2024-03-27 19:00:00 PEP 3202.078 NaN NaN NaN NaN
776000 2024-03-27 20:00:00 PEP 3156.612 NaN NaN NaN NaN
776005 2024-03-27 21:00:00 PEP 3035.994 NaN NaN NaN NaN
776009 2024-03-27 22:00:00 PEP 2861.596 NaN NaN NaN NaN
776015 2024-03-27 23:00:00 PEP 2698.059 NaN NaN NaN NaN

776016 rows × 7 columns

Date zone mw MaxTemperature MinTemperature AvgTemperature Precipitation
775886 2024-03-26 22:00:00 PEP 2751.947 61.0 39.0 50.0 0.01
775895 2024-03-26 23:00:00 PEP 2608.338 61.0 39.0 50.0 0.01
775897 2024-03-27 00:00:00 PEP 2451.982 50.0 46.0 48.0 0.78
775905 2024-03-27 01:00:00 PEP 2363.249 50.0 46.0 48.0 0.78
775910 2024-03-27 02:00:00 PEP 2337.508 50.0 46.0 48.0 0.78
775912 2024-03-27 03:00:00 PEP 2339.010 50.0 46.0 48.0 0.78
775920 2024-03-27 04:00:00 PEP 2426.600 50.0 46.0 48.0 0.78
775922 2024-03-27 05:00:00 PEP 2626.287 50.0 46.0 48.0 0.78
775930 2024-03-27 06:00:00 PEP 2933.378 50.0 46.0 48.0 0.78
775932 2024-03-27 07:00:00 PEP 3149.023 50.0 46.0 48.0 0.78
775940 2024-03-27 08:00:00 PEP 3254.780 50.0 46.0 48.0 0.78
775942 2024-03-27 09:00:00 PEP 3250.445 50.0 46.0 48.0 0.78
775949 2024-03-27 10:00:00 PEP 3259.116 50.0 46.0 48.0 0.78
775955 2024-03-27 11:00:00 PEP 3284.213 50.0 46.0 48.0 0.78
775959 2024-03-27 12:00:00 PEP 3279.457 50.0 46.0 48.0 0.78
775965 2024-03-27 13:00:00 PEP 3307.597 50.0 46.0 48.0 0.78
775967 2024-03-27 14:00:00 PEP 3300.685 50.0 46.0 48.0 0.78
775975 2024-03-27 15:00:00 PEP 3211.159 50.0 46.0 48.0 0.78
775976 2024-03-27 16:00:00 PEP 3205.710 50.0 46.0 48.0 0.78
775984 2024-03-27 17:00:00 PEP 3216.504 50.0 46.0 48.0 0.78
775990 2024-03-27 18:00:00 PEP 3189.807 50.0 46.0 48.0 0.78
775991 2024-03-27 19:00:00 PEP 3202.078 50.0 46.0 48.0 0.78
776000 2024-03-27 20:00:00 PEP 3156.612 50.0 46.0 48.0 0.78
776005 2024-03-27 21:00:00 PEP 3035.994 50.0 46.0 48.0 0.78
776009 2024-03-27 22:00:00 PEP 2861.596 50.0 46.0 48.0 0.78
776015 2024-03-27 23:00:00 PEP 2698.059 50.0 46.0 48.0 0.78

As we can see, after joining the fill-forward method was used to fill in each hour of the same date with the same weather data, until the next day.

Back to Table of Contents¶

4. Feature Engineering ¶


Many of the upcoming models will benefit by adding additional features as exogenous covariates.

Specific intervals of features such as lag, rolling mean, derivatives need to be considered carefully when used as exogenous features calculated from the target variable. This can be a great source of data leakage if these are not data you would have in a prediction setting.

  • One method to overcome this is to take a naive approach and assign last season's values.
  • Another is create a hybrid model which forecasts the exogenous features to include in the forecast.
  • Finally, a rolling forecast can predict new datapoints and retrain the model with those up until the intended forecast horizon is achieved.

For now, the majority of features and especially the shorter length interval features will be excluded to prevent data leakage, but will still remain implemented below (commented out as appropriate) as an exercise in time series feature engineering.

4.1. Date Features ¶

Even though the data is a timeseries with a datetime feature, it is important to extract that data into separate categories as there is certainly multiple seasonalities within this dataset.

For example, each day the load oscillates as people move throughout the day, but also each season likely has its own effect. Separating these features is an important step in capturing the different seasonal periods in time series forecasting.

Hour, Day, Month, Year, and Day of Year (1-365) will all be extracted.

In [311]:
load_df['Hour'] = load_df['Date'].dt.hour
load_df['Day'] = load_df['Date'].dt.day_of_week
day_map = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
load_df['Month'] = load_df['Date'].dt.month
month_map = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
load_df['Year'] = load_df['Date'].dt.year
load_df['Day_of_Year'] = load_df['Date'].dt.day_of_year
load_df.head(3)
Out[311]:
Date zone mw MaxTemperature MinTemperature AvgTemperature Precipitation Hour Day Month Year Day_of_Year
371136 2015-01-01 00:00:00 AE 1180.082 43.0 17.0 30.0 0.0 0 3 1 2015 1
371143 2015-01-01 01:00:00 AE 1129.988 43.0 17.0 30.0 0.0 1 3 1 2015 1
371149 2015-01-01 02:00:00 AE 1088.828 43.0 17.0 30.0 0.0 2 3 1 2015 1
Back to Table of Contents¶

4.2. Rolling Mean ¶

Rolling means are a very effecting feature in time series forecasting by smoothing the signal, helping models find real trends. The window iterates through the dataset by first initializing a window size of data points which are either centered or behind, and then the average is taken of that window.

Two rolling means will be added.

  • Window of 3 data points.
  • Window of 6 data points.
In [312]:
for zone in zones:
    mask = load_df['zone'] == zone
    temp_df = load_df.loc[mask]
    # # 3 Hour Window
    # load_df.loc[mask, 'Rolling_Mean_3'] = load_df[mask]['mw'].rolling(3).mean()
    # # 6 Hour Window
    # load_df.loc[mask, 'Rolling_Mean_6'] = load_df[mask]['mw'].rolling(6).mean()
    # 1 Month Window
    load_df.loc[mask, 'Rolling_Mean_1M'] = load_df[mask]['mw'].rolling(24*30).mean()
Back to Table of Contents¶

4.3. Lagged Features ¶

Lag of the target variable, MW, at different intervals. Lagged features are effective when correlation between the target variable at one time is correlated to a time a certain "lagged" distance in the past. This helps identify trends that have seasonality periods equal to the chosen lag value.

For example, a lag 24 of MW would be the value of MW 24 hours in the past.

Lags of 1, 6, 24, 24 7 (1 Week), 24 30 (1 Month) will be included.

All the way up to a year would be useful, however, it creates rows with empty NaN values the same size of the lag because it has to start that far into the future. These rows will have to be removed as some of the models cannot handle NaNs/no values.

Note: Autocorrelation Functions (ACF) and Partial Autocorrelation Functions (PACF) are great tools to plot and identify which lags have significant correlation in the data. These will be analyzed later on in the SARIMAX model section and might alter what is done in this cell.

In [313]:
# Lag Features
# Create mask for each zone since shift uses index and
# needs to reset shift at each zone.

for zone in zones:
    mask = load_df['zone'] == zone
    temp_df = load_df.loc[mask]
    # # 1 Hour
    # load_df.loc[mask, 'Lag_Hour'] = load_df[mask]['mw'].shift(1)
    # # 6 Hours
    # load_df.loc[mask, 'Lag_6_Hours'] = load_df[mask]['mw'].shift(6)
    # # 24 Hours
    # load_df.loc[mask, 'Lag_Day'] = load_df[mask]['mw'].shift(24)
    # # 1 Week
    # load_df.loc[mask, 'Lag_Week'] = load_df[mask]['mw'].shift(24*7)
    # 1 Month
    load_df.loc[mask, 'Lag_Month'] = load_df[mask]['mw'].shift(24*30)

display(load_df.head(2))
display(load_df.tail(2))
Date zone mw MaxTemperature MinTemperature AvgTemperature Precipitation Hour Day Month Year Day_of_Year Rolling_Mean_1M Lag_Month
371136 2015-01-01 00:00:00 AE 1180.082 43.0 17.0 30.0 0.0 0 3 1 2015 1 NaN NaN
371143 2015-01-01 01:00:00 AE 1129.988 43.0 17.0 30.0 0.0 1 3 1 2015 1 NaN NaN
Date zone mw MaxTemperature MinTemperature AvgTemperature Precipitation Hour Day Month Year Day_of_Year Rolling_Mean_1M Lag_Month
776009 2024-03-27 22:00:00 PEP 2861.596 50.0 46.0 48.0 0.78 22 2 3 2024 87 2831.742881 2787.267
776015 2024-03-27 23:00:00 PEP 2698.059 50.0 46.0 48.0 0.78 23 2 3 2024 87 2831.877457 2601.164
Back to Table of Contents¶

4.4. Fourier Transform ¶

Fourier transforms are incredibly powerful at decomposing signals, in this instance helping identify and separate seasonal periodicity.

By taking a Fast Fourier Transform (FFT) we can then calculate the spectral density and analyze the frequencies that are important in our data. This information alone is useful as a feature and can be used directly as it will give seasonal periods high magnitude values at their peaks, but also will guide some other aspects of this project by identifying all potential seasonality periods.

Note this is a huge subject so it won't be tackled here but fourier transform terms can also be used to extend some models that are only designed to capture one seasonal period into capturing multiples.

In [314]:
for zone in zones:
    mask = load_df['zone'] == zone
    temp_df = load_df.loc[mask]
    # Perform fast fourier transform and take abs() to get spectral density.
    load_df.loc[mask, 'FFT'] = np.abs(fft(temp_df['mw'].values)) ** 2
    # Assign sample rate and find frequencies for plot.
    # d is sample rate. 1 = samples / hour.
    load_df.loc[mask, 'FFT_Freq'] = np.fft.fftfreq(temp_df['mw'].size, d = 1)

load_df.tail(2)
Out[314]:
Date zone mw MaxTemperature MinTemperature AvgTemperature Precipitation Hour Day Month Year Day_of_Year Rolling_Mean_1M Lag_Month FFT FFT_Freq
776009 2024-03-27 22:00:00 PEP 2861.596 50.0 46.0 48.0 0.78 22 2 3 2024 87 2831.742881 2787.267 8.072594e+13 -0.000007
776015 2024-03-27 23:00:00 PEP 2698.059 50.0 46.0 48.0 0.78 23 2 3 2024 87 2831.877457 2601.164 1.734200e+15 -0.000004

You can use the frequency axis to find the corresponding periods by taking the reciprocal of the frequency (period = 1 / frequency) which was assigned above in np.fft.fftfreq() with d =.

For example, if you find a peak at frequency f, the corresponding period would be 1/f. This will be used to identify seasonal periods, but for better interpretability f will be multiplied by 24 to change the sample unit to days.

In [315]:
for zone in zones:
    plt.subplots(figsize = (12,2))
    sns.lineplot(load_df.loc[load_df['zone'] == zone], y = 'FFT', x = 'FFT_Freq', c = zones_palette[zone])
    # Assign y limits based on zone to better visualize.
    if zone in ['AE', 'JC']:
        plt.ylim(bottom = 0, top = 10000000000000)
    else:
        plt.ylim(bottom = 0, top = 1000000000000000)
    # plt.xlim(left = 0) # Negatives are redundant.
    plt.xlabel('Frequency (Hz) - Sample / Hour')
    plt.ylabel('FFT Power Spectral Density')
    plt.title(f'{zone} - Fourier Transform | Cycles')

    # List top repeating cycles.
    # Find top FFT values.
    top_freq = load_df.loc[load_df['zone'] == zone]['FFT'].sort_values(ascending = False).index[1:13:2]
    # Transform frequency to period of 1 day for interpretability
    top_cycle_days = list(np.round(1 / (np.abs(load_df.loc[top_freq]['FFT_Freq'].astype(float))*24)))
    text_box = ['\n' + (str(x)) for x in top_cycle_days]
    plt.text(x = 0.5, y = -0.02, 
             s= f"Peak Frequencies \n    (Hz to Days) {' '.join(text_box)}", 
             transform = ax.transAxes,
             bbox = dict(facecolor = zones_palette[zone], edgecolor = 'black', boxstyle = 'round', alpha = 0.5))

    plt.show()

Scales were adjusted on this graph heavily to get a better picture and may vary between zones even as they have different magnitudes.

We can see most of the different zones share peak magnitude frequencies at approximately:

  • ~182 Days (Roughly 6 Months)
  • ~365 Days (1 Year)
  • ~1 Day

These represent the strongest seasonal periods in the time series data.

Some values are duplicated because they are adjacent frequencies that were rounded.

Back to Table of Contents¶

4.5. First and Second-Order Derivatives/Differencing ¶

Sometimes called discrete derivatives or differencing, this method is analogous to derivatives but using discrete time series data. You often see this used as a standard metric financial/economic settings.

  • First-Order: Gives the instantaneous rate of change between the samples, identifying both the direction and magnitude.
  • Second-Order: Identifies the acceleration of the rate of change which can be useful in finding inflection points in time series data.
In [316]:
for zone in zones:
    mask = load_df['zone'] == zone
    temp_df = load_df.loc[mask]
    # 1st Order
    load_df.loc[mask, 'Deriv_1'] = load_df[mask]['mw'].diff()
    # 2nd Order
    load_df.loc[mask, 'Deriv_2'] = load_df[mask]['mw'].diff().diff()
In [317]:
for zone in zones:
    fig, ax = plt.subplots(3,1, figsize = (12,4), sharey = False, sharex = True)
    sns.lineplot(load_df.loc[load_df['zone'] == zone], ax = ax[0], y = 'mw', x = 'Date', c = zones_palette[zone], alpha = 0.8)
    sns.lineplot(load_df.loc[load_df['zone'] == zone], ax = ax[1], y = 'Deriv_1', x = 'Date', c = 'black', alpha = 0.8)
    sns.lineplot(load_df.loc[load_df['zone'] == zone], ax = ax[2], y = 'Deriv_2', x = 'Date', c = 'orange', alpha = 0.8)
    fig.suptitle(f"{zone} - Derivatives")
    fig.subplots_adjust(hspace = 0.1)
In [318]:
# Remove from feature list, to prevent leakage.
# Until another solution is implemented.
load_df = load_df.drop(['Deriv_1', 'Deriv_2'], axis = 1)
Back to Table of Contents¶

5. Exploratory Data Analysis (EDA) ¶


In [319]:
def plot_daterange(date_min, date_max, zone_select):
    """Plots MW vs Date in given date range and zone. Interval starts and ends at midnight.
    Note: Some of the text positions break down with larger ranges.
    Ideal range is two-weeks.

    Parameters:
        date_min (string): Date of beginning of interval (inclusive).
        date_max (string): Date of end of interval (inclusive).
        zone_select (string): Zone from load_zone_df to plot.
    
    Returns:
        None. Prints plot to output."""
    
    date_min = pd.to_datetime(f'{date_min} 00:00:00')
    date_max = pd.to_datetime(f'{date_max} 00:00:00')
    date_range = (date_max - date_min).days
    weather_temp = weather_df.loc[(weather_df['Date'].between(date_min, date_max, inclusive='both')) & 
                                (weather_df['zone'] == zone_select)].reset_index(drop = True)

    
    plt.subplots(figsize = (14, 8))
    sns.lineplot(load_zone_df.loc[(load_zone_df['Date'].between(date_min, date_max, inclusive='both')) & 
                                    (load_zone_df['zone'] == zone_select)], 
                    x = 'Date', 
                    y = 'mw',
                    c = zones_palette[zone_select],  
                    alpha = 0.7, lw = 1, marker = 'o')

    # Iterates through each day in range.
    for i in range(date_range + 1):
        # Day separating line.
        plt.axvline(date_min + pd.Timedelta(days = i), color = 'purple', ls = '--')
        if i > date_range-1:
            continue
        # Day of the week.
        plt.text(x = date_min + pd.Timedelta(days = i) + pd.Timedelta(hours = 10),
                    y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
                    s = f'{(date_min + pd.Timedelta(days = i)).day_name()}',
                    rotation = 'vertical',
                    weight='bold',
                    alpha = 0.75)
        # Temperatures.
        # High
        plt.text(x = date_min + pd.Timedelta(days = i) + pd.Timedelta(hours = 6),
                    y = ((plt.ylim()[0] + plt.ylim()[1]) / 2) + ((plt.ylim()[0] + plt.ylim()[1]) / 8),
                    s = f"Hi:{weather_temp['MaxTemperature'].iloc[i]}\N{DEGREE SIGN}",
                    rotation = 'horizontal',
                    alpha = 0.75)
        # Low
        plt.text(x = date_min + pd.Timedelta(days = i) + pd.Timedelta(hours = 6),
                    y = ((plt.ylim()[0] + plt.ylim()[1]) / 2) - ((plt.ylim()[0] + plt.ylim()[1]) / 8),
                    s = f"Lo:{weather_temp['MinTemperature'].iloc[i]}\N{DEGREE SIGN}",
                    rotation = 'horizontal',
                    alpha = 0.75)
    return None
In [320]:
plot_daterange('8/01/2023', '8/16/2023', 'DOM')

This is about as expected, load is lowest at night and rises throughout the day. Data is from August on the East Coast of the US which tends to be their hottest month.

Things to note:

  • We can see a likely positive correlation between temperature and load rising in tandem.
  • I would assume weekends have lower energy loads but it is unclear on this graph, likely due to the temperature's effect.
In [321]:
plot_daterange('01/07/2024', '01/22/2024', 'CE')

In contrast to the last plot, this is data from Chicago in one of the coldest months, January.

  • Saturday and Sunday do tend to be lower than the weekdays.
  • Temperature once again is showing itself to be a very important covariate, making load rise on 2024-1-14 when high temperatures drop from 33 to 3.
In [322]:
load_day_df = load_zone_df.copy(deep = True)
load_day_df['Date'] = pd.to_datetime(load_day_df['Date'].dt.date)
load_day_df = load_day_df.groupby(['Date', 'zone'], as_index = False).sum()

#load_day_df['mw_norm'] = float
for zone in zones:
    mask = load_day_df['zone'] == zone
    temp_df = load_day_df.loc[mask]
    min_mw = temp_df['mw'].min()
    max_mw = temp_df['mw'].max()
    load_day_df.loc[mask, 'mw_norm'] = (temp_df['mw'] - min_mw) / max_mw

palette = ['#570408', '#005d5d', '#1192e8', '#fa4d56', '#012749']
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.subplots(figsize = (12, 6))
sns.boxplot(load_day_df, 
                x = (load_day_df['Date'].dt.day_name()), 
                y = load_day_df['mw_norm'],
                hue = 'zone',
                order = order,
                palette = sns.color_palette(palette, 5),
                saturation = 0.65)
plt.grid(alpha = 0.5)

Saturday and Sunday do seem to be lower but not by a significant amount.

In [323]:
# Month
plt.subplots(figsize = (14, 6))
sns.boxplot(load_day_df, 
                x = (load_day_df['Date'].dt.month_name()), 
                y = load_day_df['mw_norm'],
                hue = 'zone',
                palette = sns.color_palette(palette, 5),
                saturation = 0.65)
plt.grid(alpha = 0.5)

Summer months not only have the most variance but also highest load demands.

Here, a histogram plot of MW of all zones will be made.

In [324]:
fig, ax = plt.subplots(1, 5, figsize=(12, 4), sharey=True)
for i, zone in enumerate(zones):
    temp_df = load_df.query("zone == @zone")['mw']
    sns.histplot(temp_df, ax = ax[i], color = zones_palette[zone])
    ax[i].axvline(temp_df.median(), c = 'black', alpha = 0.6)
    ax[i].axvline(temp_df.mean(), c = 'red', alpha = 0.6)
    ax[i].set_title(zone)
fig.suptitle('Histogram of MW')
fig.tight_layout()
  • Median in Black
  • Mean in Red
In [325]:
fig, ax = plt.subplots(1, 5, figsize=(12, 4), sharey=True)
for i, zone in enumerate(zones):
    temp_df = load_df.query("zone == @zone")['mw']
    sns.histplot(temp_df, ax = ax[i], color = zones_palette[zone])
    
    #ax[i].axvline(temp_df.mean(), c = 'red', alpha = 0.6)
    quant = load_df.query("zone == @zone")['mw'].quantile(np.linspace(0,1,5))
    for quantile in quant:
        ax[i].axvline(quantile, c = 'purple', alpha = 0.6, ls = '--')
    ax[i].set_title(zone)
    ax[i].axvline(temp_df.median(), c = 'black', alpha = 0.7)
    ax[i].axvspan(ax[i].get_xlim()[0], quant.iloc[0], facecolor='#b35d24', alpha = 0.4)
    ax[i].axvspan(quant.iloc[-1], ax[i].get_xlim()[1], facecolor='#b35d24', alpha = 0.4)
    ax[i].margins(x = 0)


fig.suptitle('Histogram of MW')
fig.tight_layout()

See the beginning of section 7.2. for a few time series specific analyses.

Back to Table of Contents¶

6. Train, Validation, and Test Split¶


To avoid data leakage the test set will always be assigned the most recent dates with zero overlap in train or validation sets.

Since each zone has different beginning dates but share the same end dates this can be used to split each zone in the same way.

Important Note: In a decision-making setting, this split should have occurred before the EDA section to avoid data-leakage/snooping. This was not done in this case as this is just an exercise.

In [326]:
# Remove any NaN rows as some models can't handle them.
load_df = load_df[~load_df.isnull().any(axis = 1)]
load_df.reset_index(drop = True, inplace = True)

# Train - Min up to 2021
# Val - 2021 up to 2023
# Test - 2023 to Max
train_cutoff = '1/1/2019'
test_cutoff = '1/1/2022'

train = {}
val = {}
test = {}
for zone in zones:
    train[zone] = []
    val[zone] = []
    test[zone] = []
    train[zone] = load_df.loc[(load_df['zone'] == zone) & 
                              (load_df['Date'] < pd.to_datetime(train_cutoff))]

    val[zone] = load_df.loc[(load_df['zone'] == zone) & 
                            (load_df['Date'] >= pd.to_datetime(train_cutoff)) & 
                            (load_df['Date'] < pd.to_datetime(test_cutoff))]
    
    test[zone] = load_df.loc[(load_df['zone'] == zone) & 
                             (load_df['Date'] >= pd.to_datetime(test_cutoff))]
    
    # Drop rows with NaN created from feature engineering.
    train[zone] = train[zone][~train[zone].isnull().any(axis = 1)]

Because of the way a few models work, the leap year in the validation set was causing issues. It will be removed here for now until another solution is worked on. Either way it shouldn't have much importance to the results.

In [327]:
# Leap year in validation causing issues in seasonal naive model.
# Removing for now.
for zone in zones:
    leap_year_idx = val[zone].loc[val[zone]['Date'].between('2020-02-29 00:00:00',
                                                            '2020-02-29 23:00:00', 
                                                            inclusive = 'both')].index
    val[zone] = val[zone].drop(leap_year_idx)

Sets will be placed into dictionaries keyed by zone name, so models can iterate through each zone and fit/predict separately.

In [328]:
X_train, X_val, X_test = {}, {}, {}
y_train, y_val, y_test = {}, {}, {}

for zone in zones:
        X_train[zone], X_val[zone], X_test[zone] = [], [], []
        y_train[zone], y_val[zone], y_test[zone] = [], [], []
        # Train
        X_train[zone] = train[zone].set_index('Date')
        y_train[zone] = X_train[zone]['mw']
        X_train[zone] = X_train[zone].iloc[:, 2:]
        # Validation
        X_val[zone] = val[zone].set_index('Date')
        y_val[zone] = X_val[zone]['mw']
        X_val[zone] = X_val[zone].iloc[:, 2:]
        # Test
        X_test[zone] = test[zone].set_index('Date')
        y_test[zone] = X_test[zone]['mw']
        X_test[zone] = X_test[zone].iloc[:, 2:]

Now, plot the split regions for reference.

Note:

  • This might be revisited in the future to prune some of the training data of the longer existing zones for better comparability between zones.
  • The location of the COVID lockdown period might be an outlier to note in this dataset as it might affect model performance depending on which region it ends up in.
    • We could add a static categorical feature denoting a hot-encoding of holidays and perhaps COVID to account for this.
In [329]:
def set_region_overlay(set_label, start_date, end_date, x_offset):
    mid_date_train = (list(pd.date_range(start = pd.to_datetime(start_date), end = pd.to_datetime(end_date))))
    plt.text(x = mid_date_train[int(len(mid_date_train) / 2) + x_offset],
                y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
                s = set_label,
                rotation = 'horizontal',
                weight = 'extra bold',
                fontsize = 'large',
                antialiased = True,
                alpha = 1,
                c = 'white',
                bbox = dict(facecolor = 'black', edgecolor = 'black', boxstyle = 'round', alpha = 0.6))
    return None

plt.subplots(figsize = (14,4))
sns.lineplot(load_day_df, x = 'Date', y = 'mw_norm', hue = 'zone', palette=sns.color_palette(palette, 5), lw = 0.7, alpha = 0.7)
plt.margins(x = 0)
plt.axvline(pd.to_datetime(train_cutoff), color = 'purple', ls = '--')
plt.axvline(pd.to_datetime(test_cutoff), color = 'purple', ls = '--')

plt.axvspan(load_day_df['Date'].min(), pd.to_datetime(train_cutoff), facecolor='#2b0054', alpha = 0.4)
plt.axvspan(pd.to_datetime(train_cutoff), pd.to_datetime(test_cutoff), facecolor='#004a54', alpha = 0.4)
plt.axvspan(pd.to_datetime(test_cutoff), load_day_df['Date'].max(), facecolor='#5e3904', alpha = 0.4)

set_region_overlay('T R A I N', load_day_df['Date'].min(), train_cutoff, 0)
set_region_overlay('V A L', train_cutoff, test_cutoff, -225)
set_region_overlay('T E S T', train_cutoff, load_day_df['Date'].max(), 225)

plt.legend().remove()
plt.show()

6.1. Validation Set Conundrum ¶

An important question arises when using a validation set to tune your models in a time series domain because of the temporal nature of the data. Oftentimes, but not always, the most recent observations have the biggest impact on future forecasted predictions. This can be an issue in the process of hyperparameter optimization because the fit model now contains a time-gap between the training set and the testing set.

A choice remains:

  1. Use the validation set only in model selection and hyperparameter tuning, hoping it generalizes well to new data.
  2. After using the validation set to determine your model and hyperparameters, refit the model including the validation set, hoping this will not overfit the now combined training and validation sets.

In this case, choice 2 will be the chosen path, but both have their merits and are highly dependant on the dataset and target value.

Back to Table of Contents¶

7. Model Setup ¶


In [330]:
# Creating this as a global for now.
horizon_selector = {'6H' : pd.Timedelta(hours = 6-1),
            '3D' : pd.Timedelta(hours = (24*3)-1),
            '1W' : pd.Timedelta(hours = (24*7)-1),
            '1M' : pd.Timedelta(hours = (24*30)-1),
            '6M' : pd.Timedelta(hours = (24*181)-1),
            '2Y' : pd.Timedelta(hours = (24*365*2)-1)}
In [331]:
def generate_test_horizon(zone, forecast_horizon, X_y):
    """Generates slice of test set according to input zone, forecast window, and either X or y - always starting at the minimum date [ :forecast_horizon].
    
    Parameters:
        zone (string): String of zone data to retrieve.
        forecast_horizon (string): 6H, 3D, 1W, 1M, 6M, 2Y which will slice test up to that amount of time.
        X_y (string): X or y to choose which to generate.
    Returns:
        Numpy Array of test[zone] interval up to the forecast window."""

    if forecast_horizon not in horizon_selector:
        raise Exception('That forecast window is not supported, please choose one of the following: 6H, 3D, 1W, 1M, 6M, 2Y.')

    date_min = test[zone]['Date'].min()
    if X_y == 'X':
        _test = test[zone].loc[test[zone]['Date'].between(date_min, 
                                                    date_min + horizon_selector[forecast_horizon], 
                                                    inclusive = 'both')].drop(['zone', 'Date', 'mw'], axis = 1)
    if X_y == 'y':
        _test = test[zone].loc[test[zone]['Date'].between(date_min, 
                                                    date_min + horizon_selector[forecast_horizon], 
                                                    inclusive = 'both')]['mw'].to_numpy()
    return _test

7.1. Baseline Models ¶

A baseline model is needed to compare all other models to. Here we will initialize two different approaches.

  1. Seasonal Naive Forecast Model - Like the name implies it is a method to naively handle time series with seasonality(ies).
  2. Mean Model - Very simple but as long as there isn't drift anticipated in future data, can score well.

Note: The time series data used here has little to no trend/drift (increasing or decreasing over time), but it would be useful to add a baseline that could handle it if we did. A baseline that captures and forecasts the slope of that trend could be used analogous to the mean model here.

7.1.1. Seasonal Naive Forecasting ¶

The Seasonal Naive Forecasting Model is a simple but often effective baseline model. It will be setup here for different forecast horizons from which all other models will be measured against.

This model forecasts future values to be equal to past values of the target variable (MW) shifted backwards one period of the input season. As we predict different forecast horizons, the length of the horizon can be used as a seasonal-lag to adjust the naive forecast lengths.

For example:

If we want to forecast a week into the future, the previously observed values from the same week last year will be used as the forecast.

Note:

Since the validation set will not be necessary in these baseline models, they are being utilized here instead of train since the validation set contains the last observed values of the last seasonal period.

In [332]:
# TODO: Generalize this section better. Handle all horizons and generalize y_pred method.
# Find date for each forecast window (corresponding date in validate forecast_window)
naive_date = {'1H' : (val[zone]['Date'].max() + pd.Timedelta(hours = 1)) - pd.Timedelta(days = 365),
              '6H' : (val[zone]['Date'].max() + pd.Timedelta(hours = 6)) - pd.Timedelta(days = 365),
              '3D' : (val[zone]['Date'].max() + pd.Timedelta(hours = 24*3)) - pd.Timedelta(days = 365),
              '1W' : (val[zone]['Date'].max() + pd.Timedelta(hours = 24*7)) - pd.Timedelta(days = 365),
              '1M' : (val[zone]['Date'].max() + pd.Timedelta(hours = 24*30)) - pd.Timedelta(days = 365),
              '6M' : (val[zone]['Date'].max() + pd.Timedelta(hours = 24*181)) - pd.Timedelta(days = 365),
              '1Y' : (val[zone]['Date'].max()),
              '2Y' : (val[zone]['Date'].max())}

# Initialize prediction dictionary - {zone : {horizon : y_pred}}
naive = {}

for zone in zones:
    naive[zone] = {}
    for horizon in horizon_selector:
        naive[zone][horizon] = []
        # 1H and 2Y have different methods currently.
        if horizon == '1H':
            naive[zone][horizon] = val[zone].loc[val[zone]['Date'] == naive_date[horizon]]['mw'].to_numpy()
            continue
        if horizon == '2Y':
            naive[zone][horizon] = val[zone].loc[val[zone]['Date'].between(naive_date['1H'] - pd.Timedelta(days = 366),
                                                                            naive_date[horizon], inclusive = 'both')]['mw'].to_numpy()
            continue
        # Create naive predictions (y_pred) of 1 Hour, 3 Day, 1 Week, 1 Month, 6 Months, 1 Year, and 2 Years forecast windows.
        naive[zone][horizon] = val[zone].loc[val[zone]['Date'].between(naive_date['1H'], 
                                                                       naive_date[horizon], inclusive = 'both')]['mw'].to_numpy()
    # Create 1Y for example plot later. Only used for that.
    naive[zone]['1Y'] = val[zone].loc[val[zone]['Date'].between(naive_date['1H'], 
                                                                       naive_date[horizon], inclusive = 'both')]['mw'].to_numpy()
In [333]:
display(naive['CE']['6H'])
array([9988.853, 9715.127, 9443.542, 9263.926, 9183.058, 9194.837])
In [334]:
print(len(naive['AE']['6M'])/(365*24))
print(len(naive['AE']['2Y'])/(365*24))
0.4958904109589041
2.0
In [335]:
plt.subplots(figsize=(10,4))
sns.lineplot(val['CE'], x = 'Date', y = 'mw', color = zones_palette['CE'], alpha = 0.8)
sns.lineplot(x = pd.date_range(test['CE']['Date'].min(), periods = 24*365, freq='H'), y = naive['CE']['1Y'], color = 'purple', alpha = 1)
sns.lineplot(x = pd.date_range(test['CE']['Date'].min(), periods = 24*365, freq='H'), y = naive['CE']['1Y'], color = 'purple', alpha = 1)
sns.lineplot(val['CE'].iloc[-365*24:], x = 'Date', y = 'mw', color = 'purple', alpha = 0.4)
plt.axvline(val['CE']['Date'].max(), color = 'black', ls = '--')
plt.axvline(val['CE']['Date'].iloc[-365*24], color = 'black', ls = '--', alpha = 0.5)
plt.xlim(val['CE']['Date'].max() - pd.Timedelta(days=365*3), val['CE']['Date'].max() + pd.Timedelta(days=365))
plt.axvspan(val['CE']['Date'].max(), val['CE']['Date'].max() + pd.Timedelta(days=365), color = 'purple', alpha = 0.2)
plt.title('Seasonal Naive Forecast Example')
plt.show()
Back to Table of Contents¶

7.1.2. Mean ¶

A very simple baseline model can be had by taking the mean of the historical time series data and using that mean to predict all future values.

In [336]:
mean = {}
for zone in zones:
    mean[zone] = {}
    for horizon in horizon_selector:
        mean[zone][horizon] = []
        y_test_horizon = generate_test_horizon(zone, horizon, 'y')
        y_mean = load_df.loc[(load_df['zone'] == zone) & (load_df['Date'] < test_cutoff)]['mw'].mean()
        y_pred = np.full_like(y_test_horizon, fill_value = y_mean)
        mean[zone][horizon] = y_pred
In [337]:
mean['PEP']['6H']
Out[337]:
array([3427.67975695, 3427.67975695, 3427.67975695, 3427.67975695,
       3427.67975695, 3427.67975695])
In [338]:
plt.subplots(figsize=(10,4))
sns.lineplot(train['PEP'], x = 'Date', y = 'mw', color = zones_palette['PEP'], alpha = 0.8)
plt.axhline(train['PEP']['mw'].mean(), color = 'purple')
plt.axvline(train['PEP']['Date'].max(), color = 'black', ls = '--')
plt.xlim(train['PEP']['Date'].max() - pd.Timedelta(days=365*3), train['PEP']['Date'].max() + pd.Timedelta(days=365))
plt.axvspan(train['PEP']['Date'].max(), train['PEP']['Date'].max() + pd.Timedelta(days=365), color = 'purple', alpha = 0.2)
plt.title('Mean Model Forecast Example')
Out[338]:
Text(0.5, 1.0, 'Mean Model Forecast Example')
Back to Table of Contents¶

7.2. Statistical Models ¶

Before fitting any type of AutoRegressive Integrated Moving Average (ARIMA) model, the time series data needs to be checked if it is stationary, or has a time-dependent structure. Then, if the data is indeed stationary, the next step is to analyze the Autocorrelation Function (ACF) and Partial Autocorrelation Functions (PACF) to help identify seasonal parameters for the model setup.

The plots in previous sections of the data were already a decent indication of the data's stationarity (also the fourier transform should confirm the upcoming ACF and PACF plots), but a statistical test can be performed to help confirm stationarity.

Augmented Dickey-Fuller Test:

  • Null Hypothesis $H_0$: Time series has a unit root, meaning it is non-stationary - data has time dependent structure.
  • Alternate Hypothesis $H_1$: Time series does not have a unit root, meaning it is stationary - does not have time-dependent structure.
In [62]:
for zone in zones:
    stationarity = (adfuller(y_train[zone]))
    # Check that the test statistic is less than the 5% confidence interval and
    # the p-value is less than 0.05.
    if (stationarity[1] < 0.05) & (stationarity[4]['5%'] > stationarity[0]):
        print(f"{zone} | Stationary! With a p-value of {stationarity[1]}, we reject the null hypothesis of Augmented Dickey-Fuller test.")
    else:
        print(f"{zone} | Not Stationary! With a p-value of {stationarity[1]}, we fail to reject the null hypothesis of Augmented Dickey-Fuller test.")
AE | Stationary! With a p-value of 1.1607469326048218e-12, we reject the null hypothesis of Augmented Dickey-Fuller test.
CE | Stationary! With a p-value of 0.0, we reject the null hypothesis of Augmented Dickey-Fuller test.
DOM | Stationary! With a p-value of 5.989020628296604e-29, we reject the null hypothesis of Augmented Dickey-Fuller test.
JC | Stationary! With a p-value of 1.3026665225491982e-18, we reject the null hypothesis of Augmented Dickey-Fuller test.
PEP | Stationary! With a p-value of 0.0, we reject the null hypothesis of Augmented Dickey-Fuller test.

Now that it has been proven the time series data is stationary, meaning no transformations need to be performed and the ACF and PACF plots can be used.

In [63]:
# Plot 2 Years
plot_acf(y_train['AE'], lags = 24*365*2, auto_ylims = True) 
# Plot 1 Day
plot_acf(y_train['AE'], lags = 30, auto_ylims = True)
plt.show()
In [64]:
# Identifying the peaks and valleys outside the confidence intervals.
# Divide by 24 hours to find the days.
print(2450/24, 'Days')
print(8700/24, 'Days')
102.08333333333333 Days
362.5 Days
In [65]:
plot_pacf(y_train['AE'], lags = 30)
plt.show()

Analyzing the ACF and PACF Plots:

  • Unsurprisingly there are complex significant oscillations in both plots indicating a SARIMA is a more appropriate model than AR, MA, ARIMA.
  • There are many autocorrelations that are significantly non-zero - the time series is non-random.
  • Large autocorrelation between lag 1 and its neighbors.
  • Seasonal periods can be seen at around 100 days and 365 days.

Potential Issues: Our view of the data in these plots might not be at the correct scale to properly adjust the SARIMA model. This can be adjusted of course however, SARIMA also struggles with large datasets especially with added exogenous covariates. Also, the fact that several seasonal periods can be identified in our data means SARIMA won't find a proper fit as it will always only fit one of the cycles - 24 hour cycle, year cycle, etc.. This also can be fixed by decomposing the signal and using fourier terms but is outside the scope of this project.

7.2.1. Seasonal Autoregressive Integrated Moving Average (SARIMA) ¶

Turned off for now (see below).

In [66]:
# order = (3, 0, 10) # (p-integration, d-AR, q-MA)
# seasonal_order = (0, 0, 0, 24) # (p, d, q, s-seasonal period)
# sarima_model = sm.tsa.SARIMAX(endog = y_train['AE'].asfreq('H'),
#                               # Only FFT for now. 
#                               #exog = X_train['AE']['FFT_Freq'].asfreq('H'), 
#                               order = order, 
#                               seasonal_order = seasonal_order,
#                               enforce_stationarity = False,
#                               enforce_invertibility = False,
#                               simple_differencing = True
#                               ).fit()

# sarima_model.summary()
In [67]:
# # Visualize validation set prediction to tune model.
# plt.subplots(figsize = (12,6))
# sns.lineplot(sarima_model.forecast(35040))#, exog = X_val['AE']['FFT_Freq']))
# sns.lineplot(y_val['AE'], c = zones_palette['AE'], alpha = 0.5)
# # plt.xlim(right = pd.to_datetime('2021-07-01'))
# # plt.ylim(bottom = 0, top = 3000)

The amount of compute it takes to fit this model properly for long forecast horizons exceeds the abilities of my machine, even with fitting and validating on one of the zones with the shortest time period. Unfortunately this means the SARIMAX model will not be completed here, however this exercise does illustrate how effective time series forecasting requires an art of balancing trade-offs.

Potential Solutions:

If the goal was only short forecast horizons, you could get away with much less training data - say 1 week for hourly forecasts or 1 month for week forecasts, etc. However, this removes the effectiveness of some of the features

A solution for long forecast horizons could be to smooth and resample the original data and use SARIMAX on that, only focusing on one of the longer seasonal periods like a year. Though, this method would yield poor performance which wouldn't perform too much better than a rolling average forecasting model when evaluating it back to the real observed target variable.

Back to Table of Contents¶

7.2.2. Holt-Winters' Method ¶

Holt-Winters’ method (Sometimes referred to as triple exponential smoothing) is an extension of exponential smoothing for data that contains both trend and seasonality.

Here you can choose either a model additive or multiplicative error terms and you also need to set a season length. During validation, a season length of 1 week (247) performed well on short forecast horizons but poor at longer horizons. A year (24365) would perform well, but due to lengthy training times, 1 week was chosen.

In [68]:
holtwinters_y_pred = {}

for zone in zones:
    print('Currently Fitting:', zone)
    holtwinters_y_pred[zone] = {}
    # Create train+val for final fit.
    temp_df = pd.concat([train[zone], val[zone]], axis = 0)
    season_length = 24*7 # 1 Week

    # Fit model per zone.
    holtwinters_model = HoltWinters(season_length = season_length, error_type = 'A')
    holtwinters_model = holtwinters_model.fit(y = temp_df['mw'].to_numpy(), X = temp_df.iloc[:, 3:].to_numpy())

    for horizon in horizon_selector:
        holtwinters_y_pred[zone][horizon] = []
        horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
        # Predict per horizon.
        holtwinters_y_pred[zone][horizon] = holtwinters_model.predict(h = horizon_freq, X = test[zone].iloc[:horizon_freq])['mean']

        # y_test_horizon = y_val[zone].iloc[:horizon_freq]
        # print(horizon, np.sqrt(sum((y_test_horizon - holtwinters_y_pred['mean'])**2)/len(y_test_horizon)))

        # plt.subplots(figsize=(10,2))
        # sns.lineplot(x = y_test_horizon.index, y = y_test_horizon)
        # sns.lineplot(x = y_test_horizon.index, y = holtwinters_y_pred['mean'], color = 'purple')
        # plt.show()
Currently Fitting: AE
Currently Fitting: CE
Currently Fitting: DOM
Currently Fitting: JC
Currently Fitting: PEP
Back to Table of Contents¶

7.2.3. Multiple Seasonal-Trend decomposition using LOESS (MSTL) ¶

Note: Currently implemented without exogenous variables. This model has the ability to include them but required too much compute for current hardware during setup validations.

In [69]:
# MSTL across all zones and horizons.
mstl_models = {}
for zone in zones: 
    mstl_models[zone] = {}
    print('Currently Fitting:', zone)
    for horizon in horizon_selector:
        if horizon in ['6M', '2Y']:
            continue # Skip, too long for this model.
        print(f"{'':<5}Forecast Horizon: {horizon}")
        mstl_models[zone][horizon] = []
        # Convert horizon which is in pd.timedelta to hours integer.
        horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
        model = MSTL(season_length = [24, 24*7, 24*182, 24*365], 
                        trend_forecaster = AutoARIMA(prediction_intervals = ConformalIntervals(h = horizon_freq, n_windows = 4)))
        mstl_models[zone][horizon] = model.fit(y = y_val[zone])
Currently Fitting: AE
     Forecast Horizon: 6H
     Forecast Horizon: 3D
     Forecast Horizon: 1W
     Forecast Horizon: 1M
Currently Fitting: CE
     Forecast Horizon: 6H
     Forecast Horizon: 3D
     Forecast Horizon: 1W
     Forecast Horizon: 1M
Currently Fitting: DOM
     Forecast Horizon: 6H
     Forecast Horizon: 3D
     Forecast Horizon: 1W
     Forecast Horizon: 1M
Currently Fitting: JC
     Forecast Horizon: 6H
     Forecast Horizon: 3D
     Forecast Horizon: 1W
     Forecast Horizon: 1M
Currently Fitting: PEP
     Forecast Horizon: 6H
     Forecast Horizon: 3D
     Forecast Horizon: 1W
     Forecast Horizon: 1M
In [70]:
# .predict() each model.
mstl_y_pred = {}
for zone in zones:
    mstl_y_pred[zone] = {}
    for horizon in horizon_selector:
        mstl_y_pred[zone][horizon] = []
        horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
        # Too long for this model.
        if horizon in ['6M', '2Y']:
            continue
        mstl_y_pred[zone][horizon] = mstl_models[zone][horizon].predict(h = horizon_freq, level = [80, 95])
In [71]:
# Plot all zones and horizon predictions.
for zone in zones:
    fig, ax = plt.subplots(4, 1, figsize = (8, 5), sharex = True)
    for i, horizon in enumerate(horizon_selector):
        if horizon in ['6M', '2Y']:
            continue # Skip, too long for this model.
        horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
        plot_df = pd.concat([y_test[zone].iloc[:horizon_freq],
                             pd.Series(mstl_y_pred[zone][horizon]['mean'],
                                       index = y_test[zone].iloc[:horizon_freq].index,
                                       name = 'y_hat')],axis = 1)
        ax[i].plot(plot_df.index, plot_df['mw'], c = zones_palette[zone], alpha = 0.9, label = 'True' if i == 0 else '')
        ax[i].plot(plot_df.index, plot_df['y_hat'], c = 'purple', alpha = 0.9, label = 'Forecast' if i == 0 else '')
        ax[i].fill_between(x = plot_df.index[-horizon_freq: ], 
                        y1 = mstl_y_pred[zone][horizon]['lo-80'][-horizon_freq: ], 
                        y2 = mstl_y_pred[zone][horizon]['hi-80'][-horizon_freq: ],
                        alpha = 0.2, color = '#3fb4fc', label = 'CI - 80' if i == 0 else '')
        ax[i].fill_between(x = plot_df.index[-horizon_freq: ], 
                y1 = mstl_y_pred[zone][horizon]['lo-95'][-horizon_freq: ], 
                y2 = mstl_y_pred[zone][horizon]['hi-95'][-horizon_freq: ],
                alpha = 0.2, color = '#82ccfa', label = 'CI - 95' if i == 0 else '')
        ax[i].grid()
        ax[i].set_title(horizon)
    fig.suptitle(f"{zone} - MSTL | All Forecast Horizons")
    fig.legend(loc = 'upper right', ncol = 2)
    fig.tight_layout()

    plt.show()
In [73]:
# Reformat into y_pred dict to match other models.
mstl_y_pred_rfmt = {}
for zone in zones:
    mstl_y_pred_rfmt[zone] = {}
    for horizon in horizon_selector:
        mstl_y_pred_rfmt[zone][horizon] = []
        # Set missing horizon results to N/A so it shows up in results.
        if horizon in ['6M', '2Y']:
            horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
            mstl_y_pred_rfmt[zone][horizon] = np.full((horizon_freq),'N/A')
            continue
        mstl_y_pred_rfmt[zone][horizon] = mstl_y_pred[zone][horizon]['mean']
Back to Table of Contents¶

7.3. Machine Learning Models ¶

7.3.1. Support Vector Regression ¶

In [74]:
# Centered and Standardized for use in SVMs.
scaler = StandardScaler()
X_train_scaled = {}
X_val_scaled = {}
X_test_scaled = {}

for zone in zones:
    X_train_scaled[zone] = []
    X_val_scaled[zone] = []
    X_test_scaled[zone] = []
    
    X_train_scaled[zone] = pd.DataFrame(scaler.fit_transform(X_train[zone]), columns = scaler.get_feature_names_out())
    X_val_scaled[zone] = pd.DataFrame(scaler.fit_transform(X_val[zone]), columns = scaler.get_feature_names_out())
    X_test_scaled[zone] = pd.DataFrame(scaler.transform(X_test[zone]), columns = scaler.get_feature_names_out())
In [75]:
# # Grid search for best parameters.
# # Utilized and commented out to save rerun time.

# params = {'C' : [750, 900, 1024], 'gamma' : [0.01, 0.1], 'kernel' : ['linear', 'rbf']}
# #ps = PredefinedSplit(test_fold = val_fold)

# grid_svm = GridSearchCV(SVR(), param_grid = params, cv = 2, verbose = 3).fit(X_train_scaled['JC'], y_train['JC'])
# mod_svm = grid_svm.best_estimator_
# print(mod_svm)
# print(mod_svm.score(X_train_scaled['JC'], y_train['JC']))
In [76]:
# SVM on all zones
svm_val_results = {}
svm_models = {}

for zone in zones:
    print('Currently Fitting:', zone)
    svm_val_results[zone] = []
    svm_models[zone] = []
    # Concat train and val sets for final training.
    X_temp_train_df = pd.concat([X_train_scaled[zone], X_val_scaled[zone]]).reset_index(drop = True)
    y_temp_train_df = pd.concat([y_train[zone], y_val[zone]]).reset_index(drop = True)
    # Fit model.
    #svm_model = SVR(C = 32768, gamma = 7.62939453125e-06, kernel = 'linear')
    svm_model = SVR(C = 900, gamma = 0.01, kernel = 'rbf')
    svm_model.fit(X_temp_train_df, y_temp_train_df)

    svm_models[zone] = svm_model
    
Currently Fitting: AE
Currently Fitting: CE
Currently Fitting: DOM
Currently Fitting: JC
Currently Fitting: PEP
In [77]:
svm_y_pred = {}
for zone in zones:
    svm_y_pred[zone] = {}
    for horizon in horizon_selector:
        svm_y_pred[zone][horizon] = []
        horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
        X_test_horizon = X_test_scaled[zone].iloc[:horizon_freq]
        svm_y_pred[zone][horizon] = svm_models[zone].predict(X_test_horizon)
Back to Table of Contents¶

7.3.2. XGBoost ¶

In [78]:
# # Grid search for best parameters.
# # Utilized and commented out to save rerun time.

# params = {'max_depth' : [3,5,7,10], 'gamma' : [0.05,0.1,0.5,5], 'learning_rate': [0.1]}
# grid_xgb = GridSearchCV(xgb.XGBRegressor(booster = 'gbtree',
#                                         n_estimators = 5000,
#                                         early_stopping_rounds = 100,
#                                         objective = 'reg:squarederror',
#                                         eval_metric = 'rmse'),
#                                 param_grid = params,
#                                 cv = 2,
#                                 verbose = 1)
# grid_xgb.fit(X_train['AE'], y_train['AE'],
#                 eval_set = [(X_train['AE'], y_train['AE']), (X_val['AE'], y_val['AE'])],
#                 verbose = 1000)
# grid_xgb.best_estimator_
In [339]:
# XGBoost on all zones
xgb_val_results = {}
xgb_models = {}

for zone in zones:
    print(zone)
    xgb_val_results[zone] = []
    xgb_models[zone] = []
    # Concat train and val sets for final training.
    X_temp_train_df = pd.concat([X_train[zone], X_val[zone]]).reset_index(drop = True)
    y_temp_train_df = pd.concat([y_train[zone], y_val[zone]]).reset_index(drop = True)
    
    xgb_model = xgb.XGBRegressor(booster = 'gbtree',
                                n_estimators = 6000,
                                max_depth = 7,
                                learning_rate = 0.1,
                                gamma = 0.1,
                                early_stopping_rounds = 100,
                                objective = 'reg:squarederror',
                                eval_metric = 'rmse')
    xgb_model.fit(X_temp_train_df, y_temp_train_df,
                eval_set = [(X_temp_train_df, y_temp_train_df)],
                verbose = 2000)

    xgb_val_results[zone] = xgb_model.best_score
    xgb_models[zone] = xgb_model
AE
[0]	validation_0-rmse:307.48081
[2000]	validation_0-rmse:10.97208
[4000]	validation_0-rmse:4.79145
[5999]	validation_0-rmse:2.44426
CE
[0]	validation_0-rmse:2137.47512
[2000]	validation_0-rmse:107.36722
[4000]	validation_0-rmse:60.90634
[5999]	validation_0-rmse:40.16256
DOM
[0]	validation_0-rmse:2223.40895
[2000]	validation_0-rmse:134.66029
[4000]	validation_0-rmse:76.31948
[5999]	validation_0-rmse:49.97660
JC
[0]	validation_0-rmse:657.91238
[2000]	validation_0-rmse:24.18403
[4000]	validation_0-rmse:10.17122
[5999]	validation_0-rmse:5.05276
PEP
[0]	validation_0-rmse:715.47620
[2000]	validation_0-rmse:53.81193
[4000]	validation_0-rmse:35.40717
[5999]	validation_0-rmse:26.46956
In [340]:
xgb_y_pred = {}
for zone in zones:
    xgb_y_pred[zone] = {}
    for horizon in horizon_selector:
        xgb_y_pred[zone][horizon] = []
        X_test_horizon = generate_test_horizon(zone, horizon, 'X')
        xgb_y_pred[zone][horizon] = xgb_models[zone].predict(X_test_horizon)
In [342]:
xgb.plot_importance(xgb_model)
plt.show()
In [343]:
xgb.plot_tree(xgb_model, num_trees=12)
Out[343]:
<Axes: >
In [344]:
plt.subplots(figsize = (10,4))
sns.lineplot(generate_test_horizon('AE', '2Y', 'y'), c = zones_palette['AE'], alpha = 0.8)
sns.lineplot(xgb_y_pred['AE']['2Y'], c = 'purple', alpha = 0.8)
Out[344]:
<Axes: >
Back to Table of Contents¶

7.4. Deep Learning Models ¶

First we need to prepare the data for use in these models.

Since we won't be able to handle training of the full dataset we will just use a small slice before the Test Set cutoff for training and slice the Test Set for different forecast horizons.

Note: All tuning was done on Train and Val sets.

In [85]:
# MacOS - Set cuda device to mps.
print(torch.backends.mps.is_available())
torch.device('mps')

# Create dataframe in Nixtla/Pytorch format
load_df.reset_index(drop = False, inplace = True)
load_df.rename(columns = {'index' : 'time_idx'}, inplace = True)
# Add 7 days of test data that will be used during .predict()
pytorch_df = load_df.loc[(load_df['Date'] >= pd.to_datetime(train_cutoff)) & 
                         (load_df['Date'] < (pd.to_datetime(test_cutoff) + pd.Timedelta(days = 30)))] # Set horizon here.
# Less to assign later if we just rename these columns.
pytorch_df = pytorch_df.rename(columns = {'Date' : 'ds', 'mw' : 'y', 'zone' : 'unique_id'})

# Build static df to groupby zone in Nixtla.
pytorch_df_static = pd.DataFrame({'unique_id' : zones,
                                     'zones' : zones})
pytorch_df_static = pytorch_df_static.join(pd.get_dummies(pytorch_df_static['zones'], dtype = int))
pytorch_df_static = pytorch_df_static.drop(['zones'], axis = 1)
pytorch_df_static
True
Out[85]:
unique_id AE CE DOM JC PEP
0 AE 1 0 0 0 0
1 CE 0 1 0 0 0
2 DOM 0 0 1 0 0
3 JC 0 0 0 1 0
4 PEP 0 0 0 0 1
In [86]:
# Train: 2021-10-03 00:00:00 to 2021-12-31 23:00:00
pytorch_train = pytorch_df[pytorch_df.ds.between(pd.to_datetime(test_cutoff) - pd.Timedelta(days = 90), test_cutoff, inclusive = 'left')]
# Test: 2022-01-01 00:00:00 to 2022-01-30 23:00:00
pytorch_test = pytorch_df[pytorch_df.ds >= test_cutoff].reset_index(drop = True)

# Find size of input and horizon. Will be same across zones.
input_size_train = int(len(pytorch_train[pytorch_train.unique_id == 'AE']) / 2.75) # 2.75 to give a few periods to train.
h_size_test = len(pytorch_test[pytorch_test.unique_id == 'AE'])

print(input_size_train)
print(h_size_test)
785
720

7.4.1. Neural Hierarchical Interpolation (N-HiTS) ¶

Developed as an improvement over the popular N-BEATS deep-learning model. The model is composed of several multilayer perceptrons (MLPs) with ReLU nonlinearities. Put simply, N-HiTS interpolates the output across blocks which each specializes in separate signal frequencies.

The paper on N-HiTS was released in November, 2022 and is worth looking through (https://arxiv.org/abs/2201.12886).

As with other models in this project, we are limited by the hardware this is run on so this is by no means a full test on this dataset. The training input was reduced to 90 days and forecast limited to 1 month (30*24). Exogenous features were slowing down training too much to be included and were causing the model to stay in a loss saddle.

Note:

Unable to resolve error that causes the kernel to crash when running N-HiTS when importing both statsforecast.models import MSTL, AutoARIMA, HoltWinters AND neuralforecast.models import NHITS, TFT.

For now, training model, saving prediction dataframes to .csv which can then be loaded after commenting out model training code block.

In [87]:
# # Set environment variable since pytorch still lacks some functionality on apple silicon.
# os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

# nhits_y_pred = {}

# for horizon in horizon_selector:
#       if horizon in ['6M', '2Y']:
#             continue # skipping these horizons for this model.
#       print(f'Currently Training: {horizon}')
#       nhits_y_pred[horizon] = []
#       horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
      
#       models = [NHITS(input_size = input_size_train,
#                   h = horizon_freq,
#                   max_steps = 1000,
#                   learning_rate = 0.001,
#                   n_pool_kernel_size = [2, 2, 1],
#                   n_freq_downsample = [4, 2, 1],
#                   mlp_units = 3 * [[512, 512]],
#                   accelerator = 'cpu',
#                   stat_exog_list = ['AE', 'CE', 'DOM', 'JC', 'PEP']),
#             ]
#       nhits_model = NeuralForecast(models = models, freq = 'H')
#       nhits_model.fit(df = pytorch_train, static_df = pytorch_df_static)
#       # Predict.
#       nhits_y_pred[horizon] = nhits_model.predict(futr_df = pytorch_test.iloc[:, 1:4]).reset_index()
#       # Save model.
#       nhits_model.save(path=current_wdir + f'/Models/NHITS/{horizon}',
#             model_index=None, 
#             overwrite=True,
#             save_dataset=True)

Comment out save as needed.

In [88]:
# # Save predictions to .csv files.
# # See note at section header.
# for horizon in nhits_y_pred.keys():
#     nhits_y_pred[horizon].to_csv(current_wdir + f'/Models/NHITS/{horizon}/{horizon}_y_pred.csv.gz', compression = 'gzip', index = False)

Load previously trained model predictions and reformat to align with other models.

In [89]:
# Load predictions from last model fit.
nhits_y_pred = {}
for horizon in horizon_selector:
    if horizon in ['6M', '2Y']:
            continue # Not in this model.
    file = current_wdir + f'/Models/NHITS/{horizon}/{horizon}_y_pred.csv.gz'
    nhits_y_pred[horizon] = []
    nhits_y_pred[horizon] = pd.read_csv(file, compression = 'gzip')
In [90]:
# Reformat into y_pred dict to match other models.
nhits_y_pred_rfmt = {}
for zone in zones:
    nhits_y_pred_rfmt[zone] = {}
    for horizon in horizon_selector:
        nhits_y_pred_rfmt[zone][horizon] = []
        # Set missing horizon results to N/A so it shows up in results.
        if horizon not in nhits_y_pred.keys():
            horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
            nhits_y_pred_rfmt[zone][horizon] = np.full((horizon_freq),'N/A')
            continue
        nhits_y_pred_rfmt[zone][horizon] = nhits_y_pred[horizon].query(
            'unique_id == @zone')['NHITS'].to_numpy()
Back to Table of Contents¶

7.4.2. Temporal Fusion Transformer ¶

As the name implies, it is based on a transformer multi-headed self-attention architecture that temporal data is fed into. It couples the transformer decoder with inputs of LSTM encoder/decoders for the static covariates. This powerful model is a great example of modern hybrid models using different architectures that specialize to effectively encode and analyze different sections of the feature space.

However, as the theme continues, my non-parallelized machine struggled to keep up with the demands of the TFT architecture. Though the results were promising, I was able to forecast up to 1 week from just ~40 days of training data and removing all exogenous covariates, but it still took 500+ minutes of training. For now this model will be deactivated until this is revisited in the future.

In [91]:
# # Set environment variable since pytorch lacks some functionality on apple silicon still.
# os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

# tft_model = NeuralForecast(
#     models = [TFT(h = h_size_test,
#                   input_size = input_size_train, # AR input size (t-input_size : t)
#                   hidden_size = 20,
#                   loss = DistributionLoss(distribution = 'StudentT', level = [80, 90]),
#                   learning_rate = 0.005,
#                   stat_exog_list = ['AE', 'CE', 'DOM', 'JC', 'PEP'],
#                   futr_exog_list = ['MaxTemperature', 'MinTemperature', 'AvgTemperature',
#                                     'Precipitation', 'Hour', 'Day',
#                                     'Month', 'Year', 'Day_of_Year',
#                                     'Rolling_Mean_3', 'Rolling_Mean_6', 'Lag_Hour',
#                                     'Lag_6_Hours', 'Lag_Day', 'Lag_Week',
#                                     'Lag_Month', 'Deriv_1', 'Deriv_2',
#                                     'FFT', 'FFT_Freq'],
#                   hist_exog_list = ['time_idx'],
#                   max_steps = 500,
#                   val_check_steps = 10,
#                   early_stop_patience_steps = 10,
#                   scaler_type = 'robust',
#                   windows_batch_size = None,
#                   enable_progress_bar = True,
#                   accelerator = 'cpu', # set to cpu for now, mps not working..
#                   # start_padding_enabled = True
#                   ), 
#                   ],
#             freq = 'H' # Hourly
#             )

# tft_model.fit(df = pytorch_train, static_df = pytorch_df_static, val_size = h_size_test)

# tft_y_pred = tft_model.predict(futr_df = pytorch_test)

# # Plot predictions
# tft_y_pred = tft_y_pred.reset_index(drop = False).drop(columns = ['unique_id', 'ds'])
# for zone in zones:

#     plt.subplots(figsize = (10, 4))
#     plot_df = pd.concat([pytorch_test, tft_y_pred], axis = 1)
#     plot_df = pd.concat([pytorch_train, plot_df])

#     plot_df = plot_df[plot_df.unique_id == zone].drop('unique_id', axis = 1)
#     plt.plot(plot_df['ds'], plot_df['y'], c = zones_palette[zone], label = 'True')
#     plt.plot(plot_df['ds'], plot_df['TFT'], c = 'purple', label = 'Forecast')
#     plt.fill_between(x = plot_df['ds'][-h_size_test: ], 
#                     y1 = plot_df['TFT-lo-90'][-h_size_test: ].values, 
#                     y2 = plot_df['TFT-hi-90'][-h_size_test: ].values,
#                     alpha = 0.2, color = '#82ccfa', label = 'CI 90')
#     plt.title(zone)
#     plt.legend()
#     plt.grid()
#     plt.plot()
In [92]:
# # Save model.
# pytorch_test.to_csv('TFT_Test.csv')
# tft_y_pred.to_csv('TFT_Predict.csv')
# tft_model.save(path=current_wdir + '/Models/TFT/nf2',
#         model_index=None, 
#         overwrite=False,
#         save_dataset=True)
Back to Table of Contents¶

8. Results Evaluation ¶


8.1. Evaluation Helper Functions ¶

In [424]:
# Global for now.
model_selector = {'Naive' : naive,
                  'Mean' : mean,
                  'HoltWinters' : holtwinters_y_pred,
                  'MSTL' : mstl_y_pred_rfmt,
                  'SVM' : svm_y_pred,
                  'XGBoost' : xgb_y_pred,
                  'NHITS' : nhits_y_pred_rfmt
                  #'TFT' : tft_y_pred
                  }
reorderdict_horizon = {'6H':0,'3D':1,'1W':2,'1M':3,'6M':4,'2Y':5}
reorderdict_model = {'Naive':0,'Mean':1,'HoltWinters':2,'MSTL':3,'SVM':4,'XGBoost':5,'NHITS':6}

Symmetric Mean Absolute Percentage Error (sMAPE) will be the evaluation metric used to compare all models. This will ensure comparisons of error across different models and forecasting horizons are normalized.

$\text{sMAPE} =\frac{200}{n}\sum_{i=1}^{n}\frac{\lvert y_i - \hat{y_i} \rvert}{\lvert y_i \rvert + \lvert \hat{y_i} \rvert}$

where $y_i$ and $\hat{y_i}$ are the actual values and forecasted values respectively, and $n$ is the number of points in the set.

One possible limitation of sMAPE is if a predicted value or actual value equals zero, the metric balloons to the maximum error value, and if both equals zero, the metric is undefined. Chosen evaluation metrics often have strengths and weaknesses, and it is yet unclear if this limitation will occur here.

In [425]:
def sMAPE(model, forecast_horizon):
    sMAPE_result = {}
    for zone in zones:     
        sMAPE_result[zone] = []
        y_pred = model_selector[model][zone][forecast_horizon]
        # Check model for missing horizon.
        if y_pred[0] == 'N/A':
            sMAPE_result[zone] = np.nan
            continue
        y_test_horizon = generate_test_horizon(zone, forecast_horizon, 'y')
        sMAPE_result[zone] = (200 / len(y_pred)) * np.sum(np.abs(y_test_horizon-y_pred) / (np.abs(y_test_horizon)+np.abs(y_pred)))

    return sMAPE_result

Root Mean Squared Error (RMSE) will also be used in model evaluation to complement sMAPE. The value of including RMSE is that it is an easily interpretable quantitative metric resulting in a value that is in the same unit as the response. It also heavily penalizes predictions the further they are from the actual value.

${RMSE} =\sqrt\frac{\sum_{i=1}^{n}(y_i - \hat{y_i})^2}{n}$

where $y_i$ and $\hat{y_i}$ are the actual values and forecasted values respectively, and $n$ is the number of points in the set.

In [426]:
def RMSE(model, forecast_horizon):
    RMSE_result = {}
    for zone in zones:     
        RMSE_result[zone] = []
        y_pred = model_selector[model][zone][forecast_horizon]
        y_test_horizon = generate_test_horizon(zone, forecast_horizon, 'y')
        RMSE_result[zone] = np.sqrt(sum((y_test_horizon - y_pred)**2)/len(y_test_horizon))

    return RMSE_result

In conjunction with the RMSE, a Normalized RMSE (NRMSE) will also be used to compare models between zones, since each zone have varying scales of the target variable.

${NRMSE} = \frac{RMSE}{\bar{Y}}$

where $\bar{Y}$ is the mean of the actual values.

In [427]:
def NormRMSE(model, forecast_horizon):
    NormRMSE_result = {}
    for zone in zones:     
        NormRMSE_result[zone] = []
        y_pred = model_selector[model][zone][forecast_horizon]
        # Check model for missing horizon.
        if y_pred[0] == 'N/A':
            NormRMSE_result[zone] = np.nan
            continue
        y_test_horizon = generate_test_horizon(zone, forecast_horizon, 'y')
        NormRMSE_result[zone] = np.sqrt(sum((y_test_horizon - y_pred)**2)/len(y_test_horizon)) / np.mean(y_test_horizon)

    return NormRMSE_result
Back to Table of Contents¶

8.2. Generate Results ¶

In [428]:
# Create NRMSE dataframe.
results_NormRMSE = pd.DataFrame()
for model in model_selector:
    #print(model)
    for horizon in horizon_selector:
        # if (model == 'MSTL' or model == 'NHITS') and horizon in ['6M', '2Y']:
        #     continue # Skip, too long for this model.
        #print(horizon)
        df_index = pd.MultiIndex.from_product([['NormRMSE'],[model],[horizon]], names = ['Metric', 'Model', 'Horizon'])        
        results_NormRMSE = pd.concat([results_NormRMSE, pd.DataFrame(NormRMSE(model, horizon), index = df_index)])

results_NormRMSE = results_NormRMSE.sort_index(level='Model', key = lambda x: x.map(reorderdict_model))
In [429]:
#results_NormRMSE.query('Horizon == @horizon').sort_index(level = 'Model')
In [430]:
results_NormRMSE_dict = {}
for horizon in horizon_selector:
    results_NormRMSE_dict[horizon] = []
    results_NormRMSE_dict[horizon] = results_NormRMSE.query('Horizon == @horizon').style.highlight_min(axis = 0, props = 'background-color:green;', subset = zones)
    #display(results_NormRMSE_dict[horizon])
In [431]:
# Create sMAPE dataframe.
results_sMAPE = pd.DataFrame()
for model in model_selector:
    #print(model)
    for horizon in horizon_selector:
        # if (model == 'MSTL' or model == 'NHITS') and horizon in ['6M', '2Y']:
        #     continue # Skip, too long for this model.
        #print(horizon)
        df_index = pd.MultiIndex.from_product([['sMAPE'],[model],[horizon]], names = ['Metric', 'Model', 'Horizon'])        
        results_sMAPE = pd.concat([results_sMAPE, pd.DataFrame(sMAPE(model, horizon), index = df_index)])

results_sMAPE = results_sMAPE.sort_index(level='Model', key = lambda x: x.map(reorderdict_model))
In [432]:
results_sMAPE_dict = {}
for horizon in horizon_selector:
    results_sMAPE_dict[horizon] = []
    results_sMAPE_dict[horizon] = results_sMAPE.query('Horizon == @horizon').style.highlight_min(axis = 0, props = 'background-color:green;', subset = zones)
    #display(results_sMAPE_dict[horizon])
Back to Table of Contents¶

8.3. Results Tables ¶

In [433]:
#results_final = {}
for horizon in horizon_selector:
    print(f"-------------------------------------- {horizon} --------------------------------------")
    #results_final[horizon] = []
    #results_final[horizon] = pd.concat([results_NormRMSE_dict[horizon], results_sMAPE_dict[horizon]], axis = 0)
    display(results_NormRMSE_dict[horizon])
    display(results_sMAPE_dict[horizon])
-------------------------------------- 6H --------------------------------------
      AE CE DOM JC PEP
Metric Model Horizon          
NormRMSE Naive 6H 0.136277 0.087492 0.150645 0.184597 0.235583
Mean 6H 0.340522 0.314349 0.186621 0.397084 0.488823
HoltWinters 6H 0.014868 0.046767 0.011711 0.020384 0.013049
MSTL 6H 0.010456 0.042240 0.020137 0.026714 0.025732
SVM 6H 0.054108 0.104567 0.076867 0.085008 0.190672
XGBoost 6H 0.045109 0.066064 0.041571 0.035517 0.028640
NHITS 6H 0.028110 0.045254 0.011302 0.017790 0.016293
      AE CE DOM JC PEP
Metric Model Horizon          
sMAPE Naive 6H 12.580831 8.368642 13.676927 16.850913 20.837337
Mean 6H 28.959313 27.093643 16.941791 33.020723 39.234910
HoltWinters 6H 1.281159 4.399481 1.074516 1.976006 1.146197
MSTL 6H 0.924006 4.236191 1.540820 2.242525 1.879716
SVM 6H 4.607263 9.269047 6.653134 7.561048 18.970828
XGBoost 6H 3.375682 6.152227 4.115225 3.328029 2.278010
NHITS 6H 2.248364 4.536223 1.046940 1.470468 1.328477
-------------------------------------- 3D --------------------------------------
      AE CE DOM JC PEP
Metric Model Horizon          
NormRMSE Naive 3D 0.119357 0.109039 0.171870 0.113405 0.200999
Mean 3D 0.208615 0.141610 0.183489 0.219810 0.280960
HoltWinters 3D 0.096301 0.104986 0.128598 0.084460 0.129481
MSTL 3D 0.132737 0.077898 0.111262 0.133788 0.131693
SVM 3D 0.088246 0.058785 0.094406 0.131786 0.118316
XGBoost 3D 0.098207 0.032681 0.082099 0.064133 0.050121
NHITS 3D 0.120104 0.076831 0.141640 0.110050 0.147622
      AE CE DOM JC PEP
Metric Model Horizon          
sMAPE Naive 3D 10.564294 8.141624 14.682704 10.053777 17.745348
Mean 3D 17.931932 11.818520 13.749959 18.790199 24.905756
HoltWinters 3D 6.795567 9.117568 7.209524 6.459062 7.638461
MSTL 3D 10.137742 6.672670 7.716595 11.919681 9.536786
SVM 3D 6.649310 5.024464 7.196546 9.249314 11.662376
XGBoost 3D 7.755542 2.698701 5.322590 4.371418 4.240230
NHITS 3D 9.395042 7.082582 10.380827 9.056158 12.270452
-------------------------------------- 1W --------------------------------------
      AE CE DOM JC PEP
Metric Model Horizon          
NormRMSE Naive 1W 0.114612 0.117962 0.137086 0.092925 0.143650
Mean 1W 0.152446 0.126032 0.221700 0.163112 0.190239
HoltWinters 1W 0.129742 0.162546 0.208207 0.120589 0.212457
MSTL 1W 0.124733 0.131192 0.119044 0.126353 0.148563
SVM 1W 0.070804 0.055620 0.089589 0.095575 0.089091
XGBoost 1W 0.072556 0.045279 0.069561 0.053372 0.052927
NHITS 1W 0.114543 0.159284 0.171549 0.104196 0.172830
      AE CE DOM JC PEP
Metric Model Horizon          
sMAPE Naive 1W 9.833909 10.046109 11.853713 7.963604 12.959489
Mean 1W 13.027409 11.028017 19.665086 13.869585 16.964675
HoltWinters 1W 10.738640 15.046106 17.100893 10.238545 18.103634
MSTL 1W 10.071614 10.309518 9.181221 10.920468 12.810960
SVM 1W 5.717646 4.692355 6.849986 7.107230 8.467719
XGBoost 1W 5.374305 3.589565 5.300548 3.915818 4.300709
NHITS 1W 9.455587 15.526011 13.352938 9.039466 14.380739
-------------------------------------- 1M --------------------------------------
      AE CE DOM JC PEP
Metric Model Horizon          
NormRMSE Naive 1M 0.121239 0.100637 0.161737 0.100371 0.148098
Mean 1M 0.126805 0.101653 0.283969 0.133212 0.162745
HoltWinters 1M 0.167221 0.167445 0.304187 0.176096 0.286739
MSTL 1M 0.149286 0.101509 0.187906 0.226123 0.189383
SVM 1M 0.084019 0.057702 0.087019 0.084813 0.090183
XGBoost 1M 0.067560 0.040742 0.089747 0.068048 0.066061
NHITS 1M 0.159065 0.212156 0.331945 0.136681 0.299784
      AE CE DOM JC PEP
Metric Model Horizon          
sMAPE Naive 1M 10.163846 7.986211 13.466885 8.498115 12.421239
Mean 1M 10.515516 8.513869 28.365910 10.961452 14.056335
HoltWinters 1M 15.198061 16.639998 30.038587 16.547223 28.951588
MSTL 1M 13.202894 8.138927 16.189155 22.238330 17.300937
SVM 1M 6.603718 4.732827 7.115081 6.492116 7.793409
XGBoost 1M 5.180421 3.061406 7.495047 5.063363 5.188263
NHITS 1M 13.960246 21.936399 31.881035 12.069495 29.509419
-------------------------------------- 6M --------------------------------------
      AE CE DOM JC PEP
Metric Model Horizon          
NormRMSE Naive 6M 0.195920 0.194636 0.173055 0.207272 0.188107
Mean 6M 0.240628 0.199582 0.218384 0.231947 0.218194
HoltWinters 6M 0.205409 0.176356 0.221137 0.195561 0.216097
MSTL 6M nan nan nan nan nan
SVM 6M 0.130074 0.093505 0.094867 0.137879 0.129636
XGBoost 6M 0.081284 0.054166 0.071832 0.114838 0.060866
NHITS 6M nan nan nan nan nan
      AE CE DOM JC PEP
Metric Model Horizon          
sMAPE Naive 6M 13.572251 12.176701 13.512928 12.933855 14.080367
Mean 6M 18.941483 15.168616 16.169793 18.350397 18.375323
HoltWinters 6M 15.425375 12.174435 15.976172 14.355150 15.959039
MSTL 6M nan nan nan nan nan
SVM 6M 9.653313 6.016505 7.315464 9.054637 11.440345
XGBoost 6M 6.139266 3.743250 5.560432 6.804394 4.516422
NHITS 6M nan nan nan nan nan
-------------------------------------- 2Y --------------------------------------
      AE CE DOM JC PEP
Metric Model Horizon          
NormRMSE Naive 2Y 0.193376 0.183827 0.169782 0.209542 0.174865
Mean 2Y 0.296964 0.207625 0.233131 0.282705 0.225961
HoltWinters 2Y 0.283207 0.171086 0.231524 0.254468 0.214478
MSTL 2Y nan nan nan nan nan
SVM 2Y 0.167635 0.097154 0.102399 0.157525 0.278998
XGBoost 2Y 0.072943 0.058035 0.078654 0.105739 0.054516
NHITS 2Y nan nan nan nan nan
      AE CE DOM JC PEP
Metric Model Horizon          
sMAPE Naive 2Y 13.257603 12.475873 14.042065 13.456625 12.813002
Mean 2Y 21.589680 16.402006 17.586856 20.421968 18.642190
HoltWinters 2Y 18.524866 11.674018 18.222272 16.527110 15.562225
MSTL 2Y nan nan nan nan nan
SVM 2Y 12.905069 6.720007 8.251229 10.808048 28.784374
XGBoost 2Y 5.588003 4.122110 6.700303 6.550198 4.082743
NHITS 2Y nan nan nan nan nan
In [476]:
fig, ax = plt.subplots(figsize = (6, 10))
sns.heatmap(results_NormRMSE, ax = ax, annot = True, fmt = '.3f')
plt.show()
In [435]:
fig, ax = plt.subplots(figsize = (6, 10))
sns.heatmap(results_sMAPE, ax = ax, annot = True, fmt = '.3f')
plt.show()
Back to Table of Contents¶

8.4. Results Plots ¶

8.4.1. Plot Results by Model and Horizon ¶

In [436]:
models_palette = {'Naive' : '#89a832',
                  'Mean' : 'black',
                  'HoltWinters' : '#3fd1ca',
                  'MSTL' : '#d1c000',
                  'XGBoost' : '#432075',
                  'SVM' : '#d4359f',
                  'NHITS' : '#cc6a31',
                  }
model_selector.keys()
Out[436]:
dict_keys(['Naive', 'Mean', 'HoltWinters', 'MSTL', 'SVM', 'XGBoost', 'NHITS'])
In [500]:
for zone in zones:
    plt.subplots(figsize=(5,4))
    for model in model_selector:
        plot_df = results_NormRMSE[zone].to_frame().query('Model == @model').sort_index(level='Horizon', key = lambda x: x.map(reorderdict_horizon))
        sns.lineplot(x = reorderdict_horizon.keys(), y = plot_df[zone].values, label = model, color = models_palette[model])
        plt.title(f'{zone}')
        plt.xlabel('Forecast Horizon')
        plt.ylabel('Normalized RMSE')
Back to Table of Contents¶

8.4.2. Plot Forecasts vs True Values ¶

In [521]:
#TODO: Clean this section up, too convoluted.
# Plot all zones and horizon predictions.
for zone in zones:
    # Two subplots for different horizon lengths.
    fig, ax = plt.subplots(3, 1, figsize = (11, 4), sharex = True)
    fig2, ax2 = plt.subplots(3, 1, figsize = (11, 5), sharex = False)
    for i, horizon in enumerate(horizon_selector):
        horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
        if i < 3: # Separate shorter forecasts from longer ones.
            for j, model in enumerate(model_selector):   
                plot_df = pd.concat([y_test[zone].iloc[:horizon_freq],
                                    pd.Series(model_selector[model][zone][horizon],
                                            index = y_test[zone].iloc[:horizon_freq].index,
                                            name = 'y_hat')],axis = 1)
                ax[i].plot(plot_df.index, plot_df['y_hat'], c = models_palette[model], alpha = 0.6, label = f'{model} Forecast' if i == 0 else '')
                if j == len(model_selector)-1: # Only need to plot true once. Last so it plots on top.
                    ax[i].plot(plot_df.index, plot_df['mw'], c = zones_palette[zone], alpha = 0.7, ls = (0, (3, 1, 1, 1, 1, 1)), label = 'True' if i == 2 else '')
                ax[i].grid()
                ax[i].set_title(horizon)
        else: # Longer forecast horizons.
            for j, model in enumerate(model_selector):
                if (model == 'MSTL' or model == 'NHITS') and horizon in ['6M', '2Y']:
                    if j == len(model_selector)-1: # Need this here so it plots true in this if case.
                        ax2[i-3].plot(plot_df.index, plot_df['mw'], c = zones_palette[zone], alpha = 0.7, ls = (0, (3, 1, 1, 1, 1, 1)))#, label = 'True' if i-3 == 0 else '')
                    continue # Skip, too long for this model.
                plot_df = pd.concat([y_test[zone].iloc[:horizon_freq],
                                    pd.Series(model_selector[model][zone][horizon],
                                            index = y_test[zone].iloc[:horizon_freq].index,
                                            name = 'y_hat')],axis = 1)
                ax2[i-3].plot(plot_df.index, plot_df['y_hat'], c = models_palette[model], alpha = 0.6, label = f'{model} Forecast' if i-3 == 0 else '')
                if j == len(model_selector)-1: # Only need to plot true once. Last so it plots on top.
                        ax2[i-3].plot(plot_df.index, plot_df['mw'], c = zones_palette[zone], alpha = 0.7, ls = (0, (3, 1, 1, 1, 1, 1)), label = 'True' if i == 3 else '')
                ax2[i-3].grid()
                ax2[i-3].set_title(horizon)
        
    fig.suptitle(f"{zone} - Model Forecasts at each Horizon")
    fig.supylabel('Megawatts (MW)')
    fig.legend(loc = 'upper right', ncol = 2, fontsize = 8.2)
    fig.tight_layout()

    fig2.suptitle(f"{zone} - Model Forecasts at each Horizon")
    fig2.supylabel('Megawatts (MW)')
    fig2.legend(loc = 'upper right', ncol = 2, fontsize = 8.2)
    fig2.tight_layout()

    plt.show()
Back to Table of Contents¶

Appendix A - Online References: ¶

Several resources that helped along the way in no particular order.

  1. Great bookdown text about forecasting basics - https://otexts.com/fpp2/stationarity.html#stationarity
  2. Nice feature engineering list to help get started - https://hackernoon.com/advanced-techniques-for-time-series-data-feature-engineering
  3. Open-source library/framework for good time series forecasting practices - https://nixtlaverse.nixtla.io/
  4. Another good bookdown for time series foundation - https://bookdown.org/josephs_david11/tsReview/a-brief-discussion-of-stationarity.html
  5. Paper proposing MSTL - https://arxiv.org/pdf/2107.13462.pdf
  6. CloudOps time series traces on pre-training, transfer learning, and zero-shot predictions - https://arxiv.org/pdf/2310.05063
  7. A meta-analysis of deep learning techniques in time series forecasting and evaluation techniques - https://neptjournal.com/upload-images/(7)B-4054.pdf
  8. PJM releases their annual load forecast with detailed report here - https://www.pjm.com/-/media/planning/res-adeq/load-forecast/load-forecast-supplement.ashx
Back to Table of Contents¶

END¶


Everything below is not necessary for this project.


Generation Situation¶


Keeping all the energy generation code here until I decide how it fits into this project. Might spin off into its own depending on timing.

In [438]:
data_folder_gen = current_wdir + '/Data/Gen_by_Fuel'
In [439]:
file_path_gen = [f'{data_folder_gen}/{file}' for file in os.listdir(data_folder_gen) if '.csv' in file]
file_path_gen = sorted(file_path_gen)
In [440]:
gen_df = pd.concat([pd.read_csv(file, compression = 'gzip') for file in file_path_gen], join = 'outer', ignore_index = False, axis = 0)
In [441]:
display(gen_df)
display(gen_df.dtypes)
datetime_beginning_utc datetime_beginning_ept fuel_type mw fuel_percentage_of_total is_renewable
0 1/1/2017 4:00:00 AM 12/31/2016 11:00:00 PM Coal 34820 0.41 False
1 1/1/2017 4:00:00 AM 12/31/2016 11:00:00 PM Gas 11169 0.13 False
2 1/1/2017 4:00:00 AM 12/31/2016 11:00:00 PM Hydro 699 0.01 True
3 1/1/2017 4:00:00 AM 12/31/2016 11:00:00 PM Multiple Fuels 266 0.00 False
4 1/1/2017 4:00:00 AM 12/31/2016 11:00:00 PM Nuclear 34269 0.41 False
... ... ... ... ... ... ...
21095 1/1/2024 5:00:00 AM 1/1/2024 12:00:00 AM Oil 221 0.00 False
21096 1/1/2024 5:00:00 AM 1/1/2024 12:00:00 AM Other Renewables 690 0.01 True
21097 1/1/2024 5:00:00 AM 1/1/2024 12:00:00 AM Solar 12 0.00 True
21098 1/1/2024 5:00:00 AM 1/1/2024 12:00:00 AM Storage 0 0.00 False
21099 1/1/2024 5:00:00 AM 1/1/2024 12:00:00 AM Wind 4004 0.04 True

783663 rows × 6 columns

datetime_beginning_utc       object
datetime_beginning_ept       object
fuel_type                    object
mw                            int64
fuel_percentage_of_total    float64
is_renewable                   bool
dtype: object
In [442]:
gen_df[date_cols] = gen_df[date_cols].apply(pd.to_datetime, format = '%m/%d/%Y %I:%M:%S %p', utc = False)
In [443]:
gen_fuel_df = gen_df[['datetime_beginning_ept', 'fuel_type', 'mw']].groupby(['datetime_beginning_ept', 'fuel_type'], as_index = False).sum()
In [444]:
palette = ['#570408', '#005d5d', '#1192e8', '#9f1853', '#ee538b', '#012749', '#009d9a', '#002d9c', '#fa4d56', '#198038', '#b28600', '#6929c4']

ax, fig = plt.subplots(figsize = (14, 4))
ax = sns.lineplot(gen_fuel_df, x = 'datetime_beginning_ept', y = 'mw',  hue = 'fuel_type', alpha = 0.7, lw = 1, palette = sns.color_palette(palette, 12))
ax.margins(x = 0)
ax.set_xlabel('Time')
ax.set_ylabel('Megawatts (MW)')
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles = handles, labels = labels)
plt.legend(loc = 'upper left')
plt.show()
In [445]:
ax, fig = plt.subplots(figsize = (14, 8))
ax = sns.lineplot(gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Coal'], x = 'datetime_beginning_ept', y = 'mw', lw = 1)
In [446]:
ax, fig = plt.subplots(figsize = (14, 8))
ax = sns.lineplot(gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Gas'], x = 'datetime_beginning_ept', y = 'mw', lw = 1)
In [447]:
ax, fig = plt.subplots(figsize = (14, 8))
ax = sns.lineplot(gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Solar'], x = 'datetime_beginning_ept', y = 'mw', lw = 1)

Seeking the Prophet¶


Temporarily building a model from Meta's Prophet to use in a few figures.

In [448]:
# df = pd.DataFrame({'ds' : gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Gas']['datetime_beginning_ept'],
#                    'y' : gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Gas']['mw']})
# m = Prophet()
# m.fit(df)

# future = m.make_future_dataframe(periods=365*2)
# display(future.tail())
# forecast = m.predict(future)
# display(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())
# fig1 = m.plot(forecast)
# ax = fig1.gca()
# plt.axvline(max(df['ds']), color = 'purple', ls = '--')
In [449]:
# # Sourced from the prophet repository and repurposed to include:
# # - Delineate observed data and forecasted data (dashed line)
# # - Separate observed data and forecasted data for different visuals
# # - Color changes
# # - Zoom into the last few years to better visualize forecast (currently hardcoded)

# def prophetplot(m, fcst, ax=None, uncertainty=True, plot_cap=True, xlabel='Time', ylabel='Megawatts (MW)',
#     figsize=(10, 6), include_legend=False):
#     """Plot the Prophet forecast.

#     Parameters
#     ----------
#     m: Prophet model.
#     fcst: pd.DataFrame output of m.predict.
#     ax: Optional matplotlib axes on which to plot.
#     uncertainty: Optional boolean to plot uncertainty intervals, which will
#         only be done if m.uncertainty_samples > 0.
#     plot_cap: Optional boolean indicating if the capacity should be shown
#         in the figure, if available.
#     xlabel: Optional label name on X-axis
#     ylabel: Optional label name on Y-axis
#     figsize: Optional tuple width, height in inches.
#     include_legend: Optional boolean to add legend to the plot.
#     Returns
#     -------
#     A matplotlib figure.
#     """
#     user_provided_ax = False if ax is None else True
#     if ax is None:
#         fig = plt.figure(facecolor='w', figsize=figsize)
#         ax = fig.add_subplot(111)
#     else:
#         fig = ax.get_figure()
#     fcst_t = fcst['ds']#.dt.to_pydatetime()

#     # Initialize the observation and forecast data variables
#     # Finds and separates the two sets by using the max date in (observed) input data.
#     obs_t = fcst_t.loc[fcst_t < max(df['ds'])]
#     obs_yhat = fcst['yhat'].iloc[ : fcst['ds'].loc[fcst['ds'] <= max(df['ds'])].index[-1]]
#     obs_lower = fcst['yhat_lower'].iloc[ : fcst['ds'].loc[fcst['ds'] <= max(df['ds'])].index[-1]]
#     obs_upper = fcst['yhat_upper'].iloc[ : fcst['ds'].loc[fcst['ds'] <= max(df['ds'])].index[-1]]
#     fc_t = fcst_t.loc[fcst_t >= max(df['ds'])]
#     fc_yhat = fcst['yhat'].iloc[fcst['ds'].loc[fcst['ds'] >= max(df['ds'])].index[0]:]
#     fc_lower = fcst['yhat_lower'].iloc[fcst['ds'].loc[fcst['ds'] >= max(df['ds'])].index[0]:]
#     fc_upper = fcst['yhat_upper'].iloc[fcst['ds'].loc[fcst['ds'] >= max(df['ds'])].index[0]:]

#     ax.plot(m.history['ds'].dt.to_pydatetime(), m.history['y'], '.', c = '#012749', markersize = 1, alpha = 0.9,
#             label='Observed data points')
#     ax.plot(obs_t, obs_yhat, ls='-', c='#005d5d', label='Forecast', lw = 0.75, alpha = 0.9)
#     ax.plot(fc_t, fc_yhat, ls='-', c='#382238', label='Predict', lw = 0.75, alpha = 1)
#     if 'cap' in fcst and plot_cap:
#         ax.plot(fcst_t, fcst['cap'], ls='--', c='k', label='Maximum capacity')
#     if m.logistic_floor and 'floor' in fcst and plot_cap:
#         ax.plot(fcst_t, fcst['floor'], ls='--', c='k', label='Minimum capacity')
#     if uncertainty and m.uncertainty_samples:
#         ax.fill_between(obs_t, obs_lower, obs_upper,
#                         color='#009d9a', alpha=0.4, label='Uncertainty interval')
#         ax.fill_between(fc_t, fc_lower, fc_upper,
#                         color='#753f75', alpha=0.4, label='Uncertainty interval')
#     # Specify formatting to workaround matplotlib issue #12925
#     locator = AutoDateLocator(interval_multiples=False)
#     formatter = AutoDateFormatter(locator)
#     ax.xaxis.set_major_locator(locator)
#     ax.xaxis.set_major_formatter(formatter)
#     ax.grid(True, which='major', c='gray', ls='-', lw=0.75, alpha=0.2)
#     ax.set_xlabel(xlabel)
#     ax.set_ylabel(ylabel)

#     # Zoom in x-axis closer to prediction.
#     ax.set_xlim(pd.to_datetime(['2020-01-01 00:00:00', '2026-01-01 00:00:00']))
#     ax.set_ylim(0, 90000)

#     # Prediction divider line.
#     plt.axvline(max(df['ds']), color = 'black', ls = '--', lw = 1.2)

#     # Linear trend line to calculate yearly increase.
#     obs_poly_fn = np.poly1d(np.polyfit(range(len(obs_t)), obs_yhat, 1))
#     plt.plot(obs_t, obs_poly_fn(range(len(obs_t))), c = '#913529')

#     fc_poly_fn = np.poly1d(np.polyfit(range(len(fc_t)), fc_yhat, 1))
#     plt.plot(fc_t, fc_poly_fn(range(len(fc_t))), c = '#252f9c')

#     if include_legend:
#         ax.legend()
#         plt.legend(loc = 'upper left')
#     if not user_provided_ax:
#         fig.tight_layout()
#     return fig
In [450]:
# prophetplot(m, forecast)
# plt.show()